Discussion:
[OT] More details about the BA data centre disaster
(too old to reply)
Simon Clubley
2017-06-02 21:21:56 UTC
Permalink
Raw Message
Since it's already been discussed here, I thought I would post this
as more details have emerged about what went wrong:

https://www.theregister.co.uk/2017/06/02/british_airways_data_centre_configuration/

The whole article is a very interesting read so I won't quote segments
apart from this one because it directly relates to what VMS can do:

|Bill Francis, Head of Group IT at BA's owner International Airlines
|Group (IAG), has sent an email to staff saying an investigation so far
|had found that an Uninterruptible Power Supply to a core data centre
|at Heathrow was over-ridden on Saturday morning. He said: "This
|resulted in the total immediate loss of power to the facility,
|bypassing the backup generators and batteries. This in turn meant that
|the controlled contingency migration to other facilities could not be
|applied. "After a few minutes of this shutdown of power, it was turned
|back on in an unplanned and uncontrolled fashion, which created
|physical damage to the system, and significantly exacerbated the
|problem.

What I don't understand however is why operations didn't simply fall
over to the backup data centre automatically on a total power failure.
This has been a solved problem in the VMS world for decades and from
what I have read about IBM's geographically distributed sysplex
capabilites, IBM have similar capabilities so why were BA not using
these capabilities ?

Based on the above email, why on earth does a failing data centre need
to be kept alive long enough for a "controlled contingency migration"
to be carried out ?

Simon.
--
Simon Clubley, ***@remove_me.eisner.decus.org-Earth.UFP
Microsoft: Bringing you 1980s technology to a 21st century world
David Froble
2017-06-02 21:41:08 UTC
Permalink
Raw Message
Post by Simon Clubley
Since it's already been discussed here, I thought I would post this
https://www.theregister.co.uk/2017/06/02/british_airways_data_centre_configuration/
The whole article is a very interesting read so I won't quote segments
|Bill Francis, Head of Group IT at BA's owner International Airlines
|Group (IAG), has sent an email to staff saying an investigation so far
|had found that an Uninterruptible Power Supply to a core data centre
|at Heathrow was over-ridden on Saturday morning. He said: "This
|resulted in the total immediate loss of power to the facility,
|bypassing the backup generators and batteries. This in turn meant that
|the controlled contingency migration to other facilities could not be
|applied. "After a few minutes of this shutdown of power, it was turned
|back on in an unplanned and uncontrolled fashion, which created
|physical damage to the system, and significantly exacerbated the
|problem.
What I don't understand however is why operations didn't simply fall
over to the backup data centre automatically on a total power failure.
This has been a solved problem in the VMS world for decades and from
what I have read about IBM's geographically distributed sysplex
capabilites, IBM have similar capabilities so why were BA not using
these capabilities ?
Based on the above email, why on earth does a failing data centre need
to be kept alive long enough for a "controlled contingency migration"
to be carried out ?
Simon.
Well, pure speculation, but it sounds to me as if the people there didn't know
how to do a recovery. Lack of training? Lack of ??????
Kerry Main
2017-06-03 00:43:32 UTC
Permalink
Raw Message
-----Original Message-----
David Froble via Info-vax
Sent: June 2, 2017 5:41 PM
Subject: Re: [Info-vax] [OT] More details about the BA data centre
disaster
Post by Simon Clubley
Since it's already been discussed here, I thought I would post this
https://www.theregister.co.uk/2017/06/02/british_airways_data_centre
_configuration/
Post by Simon Clubley
The whole article is a very interesting read so I won't quote segments
|Bill Francis, Head of Group IT at BA's owner International Airlines
|Group (IAG), has sent an email to staff saying an investigation so far
|had found that an Uninterruptible Power Supply to a core data centre
|at Heathrow was over-ridden on Saturday morning. He said: "This
|resulted in the total immediate loss of power to the facility,
|bypassing the backup generators and batteries. This in turn meant
that
Post by Simon Clubley
|the controlled contingency migration to other facilities could not be
|applied. "After a few minutes of this shutdown of power, it was
turned
Post by Simon Clubley
|back on in an unplanned and uncontrolled fashion, which created
|physical damage to the system, and significantly exacerbated the
|problem.
What I don't understand however is why operations didn't simply fall
over to the backup data centre automatically on a total power failure.
This has been a solved problem in the VMS world for decades and from
what I have read about IBM's geographically distributed sysplex
capabilites, IBM have similar capabilities so why were BA not using
these capabilities ?
Based on the above email, why on earth does a failing data centre
need
Post by Simon Clubley
to be kept alive long enough for a "controlled contingency
migration"
Post by Simon Clubley
to be carried out ?
Simon.
Well, pure speculation, but it sounds to me as if the people there
didn't
know
how to do a recovery. Lack of training? Lack of ??????
Lack of experience with mission critical systems support and proper DR
planning using solutions that are not easily implemented in a multi-site
fashion.

Ask any Customer who has MS Exchange (even recent versions) what their
DR solution is and if they have ever tested it.

What if [plane crash, terrorist, sudden dam burst up river] took out the
entire primary DC i.e. the primary does not exist any longer?

For this scenario, there is no "controlled contingency migration"
possible.

DR is like insurance - you can scrimp with low end solutions and save
$'s hoping you will never have to actually use it, but if something
really bad happens, you are going to regret not having the proper
coverage.

Just ask BA.


Regards,

Kerry Main
Kerry dot main at starkgaming dot com
Kerry Main
2017-06-03 14:08:17 UTC
Permalink
Raw Message
-----Original Message-----
Sent: June 2, 2017 8:44 PM
Subject: RE: [Info-vax] [OT] More details about the BA data centre
disaster
-----Original Message-----
David Froble via Info-vax
Sent: June 2, 2017 5:41 PM
Subject: Re: [Info-vax] [OT] More details about the BA data centre
disaster
Post by Simon Clubley
Since it's already been discussed here, I thought I would post this
https://www.theregister.co.uk/2017/06/02/british_airways_data_centre
_configuration/
Post by Simon Clubley
The whole article is a very interesting read so I won't quote
segments
Post by Simon Clubley
|Bill Francis, Head of Group IT at BA's owner International Airlines
|Group (IAG), has sent an email to staff saying an investigation
so
far
Post by Simon Clubley
|had found that an Uninterruptible Power Supply to a core data
centre
Post by Simon Clubley
|at Heathrow was over-ridden on Saturday morning. He said: "This
|resulted in the total immediate loss of power to the facility,
|bypassing the backup generators and batteries. This in turn meant
that
Post by Simon Clubley
|the controlled contingency migration to other facilities could
not
be
Post by Simon Clubley
|applied. "After a few minutes of this shutdown of power, it was
turned
Post by Simon Clubley
|back on in an unplanned and uncontrolled fashion, which created
|physical damage to the system, and significantly exacerbated the
|problem.
What I don't understand however is why operations didn't simply fall
over to the backup data centre automatically on a total power
failure.
Post by Simon Clubley
This has been a solved problem in the VMS world for decades and
from
Post by Simon Clubley
what I have read about IBM's geographically distributed sysplex
capabilites, IBM have similar capabilities so why were BA not using
these capabilities ?
Based on the above email, why on earth does a failing data centre
need
Post by Simon Clubley
to be kept alive long enough for a "controlled contingency
migration"
Post by Simon Clubley
to be carried out ?
Simon.
Well, pure speculation, but it sounds to me as if the people there
didn't
know
how to do a recovery. Lack of training? Lack of ??????
Lack of experience with mission critical systems support and proper DR
planning using solutions that are not easily implemented in a
multi-site
fashion.
Ask any Customer who has MS Exchange (even recent versions) what
their
DR solution is and if they have ever tested it.
What if [plane crash, terrorist, sudden dam burst up river] took out
the
entire primary DC i.e. the primary does not exist any longer?
For this scenario, there is no "controlled contingency migration"
possible.
DR is like insurance - you can scrimp with low end solutions and save
$'s hoping you will never have to actually use it, but if something
really bad happens, you are going to regret not having the proper
coverage.
Just ask BA.
Regards,
Kerry Main
Kerry dot main at starkgaming dot com
Came across the following BA article on BBC web site: (creative finger
pointing by IT Execs - "not my problem")

<http://www.bbc.com/news/technology-40132540>
"Last weekend's catastrophic failure in BA's computer system threw the
travel plans of 75,000 passengers into chaos. What went wrong has become
a little clearer - it appears the power somehow went off at a Heathrow
data centre and when it was switched back on a power surge somehow took
out the whole system.

Airline bosses insist that this means the whole incident was a power
failure not an IT failure - but experts point out that power management
is an essential element of any well-planned IT system.

Bert Craven of the consultancy T2RL, who has designed systems for major
airlines, tells us the real question is whether the airline had what he
calls geo-redundancy."

[see url for rest of article]


Regards,

Kerry Main
Kerry dot main at starkgaming dot com
Scott Dorsey
2017-06-07 12:51:14 UTC
Permalink
Raw Message
Post by Simon Clubley
Based on the above email, why on earth does a failing data centre need
to be kept alive long enough for a "controlled contingency migration"
to be carried out ?
In order to make sure that all transactions are closed out and that the
backup data center state perfectly matches the state of the failing center.

In most cases, the backup data center is not simultaneously processing
the transactions during normal use along with the primary. That sort of
configuration is done when the highest reliability is needed, but it means
that transactions are much slower because the total transaction time depends
on the longer of the two turnaround times to the two data centers. So it is
more common for the backup center to just mirror itself from the main center
periodically.

This means that the two data centers are always at slightly different states,
and that might mean only seconds difference and dozens of transactions
worth of difference, but that's still reason to flush everything and make
sure everything is completely mirrored in the last minutes before going out.
--scott
--
"C'est un Nagra. C'est suisse, et tres, tres precis."
Simon Clubley
2017-06-07 18:04:06 UTC
Permalink
Raw Message
Post by Scott Dorsey
Post by Simon Clubley
Based on the above email, why on earth does a failing data centre need
to be kept alive long enough for a "controlled contingency migration"
to be carried out ?
In order to make sure that all transactions are closed out and that the
backup data center state perfectly matches the state of the failing center.
In most cases, the backup data center is not simultaneously processing
the transactions during normal use along with the primary. That sort of
configuration is done when the highest reliability is needed, but it means
that transactions are much slower because the total transaction time depends
on the longer of the two turnaround times to the two data centers. So it is
more common for the backup center to just mirror itself from the main center
periodically.
Personally, I would have thought that an airline data centre does have
a critical reliability requirement, but if the above is the case here,
then clearly the bean counters think otherwise.

It does mean of course that some VMS clusters actually have a higher
disaster fault tolerance than the above mainframe class data centres.
Post by Scott Dorsey
This means that the two data centers are always at slightly different states,
and that might mean only seconds difference and dozens of transactions
worth of difference, but that's still reason to flush everything and make
sure everything is completely mirrored in the last minutes before going out.
--scott
If this is the case, then I don't see how BA could ever have regarded
it's backup data centre as being disaster tolerant as opposed to merely
power failure tolerant (provided of course in the latter case no-one
pulls the wrong switch. :-))

BTW, I do feel sorry for the person who actually dropped the power and
then apparently panicked. Based on what we know so far, this was far
more a process, procedures and investment problem than it was the fault
of a single contractor.

Simon.
--
Simon Clubley, ***@remove_me.eisner.decus.org-Earth.UFP
Microsoft: Bringing you 1980s technology to a 21st century world
Scott Dorsey
2017-06-07 19:21:17 UTC
Permalink
Raw Message
Post by Simon Clubley
Post by Scott Dorsey
Post by Simon Clubley
Based on the above email, why on earth does a failing data centre need
to be kept alive long enough for a "controlled contingency migration"
to be carried out ?
In order to make sure that all transactions are closed out and that the
backup data center state perfectly matches the state of the failing center.
In most cases, the backup data center is not simultaneously processing
the transactions during normal use along with the primary. That sort of
configuration is done when the highest reliability is needed, but it means
that transactions are much slower because the total transaction time depends
on the longer of the two turnaround times to the two data centers. So it is
more common for the backup center to just mirror itself from the main center
periodically.
Personally, I would have thought that an airline data centre does have
a critical reliability requirement, but if the above is the case here,
then clearly the bean counters think otherwise.
No, the airline data center can tolerate an outage of a few seconds or
even a minute for the failover. And it's an application where performance
is paramount, so the SAABRE people took the approach of doing constant updates
rather than duplicating transactions.
Post by Simon Clubley
It does mean of course that some VMS clusters actually have a higher
disaster fault tolerance than the above mainframe class data centres.
I am sure. You design fault tolerance around the kinds of faults you expect
and the kinds of outages you can live with. And the money that you have.
Post by Simon Clubley
Post by Scott Dorsey
This means that the two data centers are always at slightly different states,
and that might mean only seconds difference and dozens of transactions
worth of difference, but that's still reason to flush everything and make
sure everything is completely mirrored in the last minutes before going out.
If this is the case, then I don't see how BA could ever have regarded
it's backup data centre as being disaster tolerant as opposed to merely
power failure tolerant (provided of course in the latter case no-one
pulls the wrong switch. :-))
Likely they considered the normal worst-case of total loss of the primary
data center, with the backup data center picking up the load after a few
seconds and a few confirmed transctions being lost as the primary data center
was somehow unable to update the backup as being a reasonable and acceptable
risk.

Nobody thought about what would happen if the primary data center was lost and
then some idiot screwed the backup center up too.
Post by Simon Clubley
BTW, I do feel sorry for the person who actually dropped the power and
then apparently panicked. Based on what we know so far, this was far
more a process, procedures and investment problem than it was the fault
of a single contractor.
I'm wondering if this was possibly a case of the system failing over properly
when the power was dropped but then the primary center not properly taking
back over when the power was resumed.
--scott
--
"C'est un Nagra. C'est suisse, et tres, tres precis."
Kerry Main
2017-06-10 17:37:03 UTC
Permalink
Raw Message
-----Original Message-----
Dorsey via Info-vax
Sent: June 7, 2017 3:21 PM
Subject: Re: [Info-vax] [OT] More details about the BA data centre
disaster
Post by Simon Clubley
Post by Scott Dorsey
Post by Simon Clubley
Based on the above email, why on earth does a failing data centre
need
Post by Simon Clubley
Post by Scott Dorsey
Post by Simon Clubley
to be kept alive long enough for a "controlled contingency
migration"
Post by Simon Clubley
Post by Scott Dorsey
Post by Simon Clubley
to be carried out ?
In order to make sure that all transactions are closed out and that the
backup data center state perfectly matches the state of the failing
center.
Post by Simon Clubley
Post by Scott Dorsey
In most cases, the backup data center is not simultaneously
processing
Post by Simon Clubley
Post by Scott Dorsey
the transactions during normal use along with the primary. That
sort
of
Post by Simon Clubley
Post by Scott Dorsey
configuration is done when the highest reliability is needed, but
it
means
Post by Simon Clubley
Post by Scott Dorsey
that transactions are much slower because the total transaction
time
depends
Post by Simon Clubley
Post by Scott Dorsey
on the longer of the two turnaround times to the two data centers.
So it is
Post by Simon Clubley
Post by Scott Dorsey
more common for the backup center to just mirror itself from the
main center
Post by Simon Clubley
Post by Scott Dorsey
periodically.
Personally, I would have thought that an airline data centre does have
a critical reliability requirement, but if the above is the case here,
then clearly the bean counters think otherwise.
"Critical", like beauty, is in the eyes of the beholder.

Service availability is analogous to insurance. How much are you willing
to pay for something you may not ever use?

It's a tough sell to Senior Execs who are not aware of just how much
their business is dependent on IT.

Ask BA today if they think they need additional availability
"insurance".

Hence, this is why the terms RTO (recovery time to be back online) and
RPO (how much data can be lost) were developed.

The lower the RTO/RPO values, the more "insurance" (more costs) you need
to invest in.
No, the airline data center can tolerate an outage of a few seconds or
even a minute for the failover. And it's an application where
performance
is paramount, so the SAABRE people took the approach of doing
constant updates
rather than duplicating transactions.
Post by Simon Clubley
It does mean of course that some VMS clusters actually have a higher
disaster fault tolerance than the above mainframe class data centres.
I am sure. You design fault tolerance around the kinds of faults you expect
and the kinds of outages you can live with. And the money that you have.
Post by Simon Clubley
Post by Scott Dorsey
This means that the two data centers are always at slightly
different
states,
Post by Simon Clubley
Post by Scott Dorsey
and that might mean only seconds difference and dozens of
transactions
Post by Simon Clubley
Post by Scott Dorsey
worth of difference, but that's still reason to flush everything
and
make
Post by Simon Clubley
Post by Scott Dorsey
sure everything is completely mirrored in the last minutes before
going out.
Post by Simon Clubley
If this is the case, then I don't see how BA could ever have regarded
it's backup data centre as being disaster tolerant as opposed to merely
power failure tolerant (provided of course in the latter case no-one
pulls the wrong switch. :-))
Likely they considered the normal worst-case of total loss of the primary
data center, with the backup data center picking up the load after a few
seconds and a few confirmed transctions being lost as the primary data center
was somehow unable to update the backup as being a reasonable and acceptable
risk.
Nobody thought about what would happen if the primary data center was lost and
then some idiot screwed the backup center up too.
Post by Simon Clubley
BTW, I do feel sorry for the person who actually dropped the power
and
Post by Simon Clubley
then apparently panicked. Based on what we know so far, this was far
more a process, procedures and investment problem than it was the
fault
Post by Simon Clubley
of a single contractor.
I'm wondering if this was possibly a case of the system failing over properly
when the power was dropped but then the primary center not properly taking
back over when the power was resumed.
--scott
--
Very few companies ever do a full DR test - it simply is to big of a
cost in terms of resources, planning, risk, technical challenges etc.
Even if they do a test, it is usually only limited to critical apps. In
this day and age, this is likely a waste of time, because all of those
smaller feeder apps are usually critical components of the critical
Apps.

Imho, the service availability compute model of the future is going to
be:
1. Eliminate DR planning
2. Architect DR processes and technologies into the day to day operation
processes and flows with virtual DC's i.e. 2 sites within 100km linked
with dual links from different providers on different power grids. From
end user/developer perspective, it looks like one server and one site.
3. You need to plan for worst case. With active-passive solutions,
depending on the timeframe for inter site replication, one needs to
assume some data will be lost in a significant event where a primary
site is suddenly gone.
4. With active-passive, Service A primary is offered via site1 with
Service A backup via site2 for 1st quarter. In subsequent quarters, the
primary-backup sites are reversed with planned switch. Hence, each
quarter, you are constantly checking your "DR" solution via normal
Operations processes.
5. While more expensive, active-active solution is a load balanced,
multi-site cluster is the right solution where the RPO is zero (no data
loss) and RTO can be a few minutes. This is often (not always) the case
with banks and stock exchanges.

Case in point - as I recall, there was one financial company that lost
no transactions and clients did not even know one of the companies A-A
sites was in one of the towers on that tragic day 9/11.


Regards,

Kerry Main
Kerry dot main at starkgaming dot com
Michael Moroney
2017-06-10 19:05:17 UTC
Permalink
Raw Message
Post by Kerry Main
Case in point - as I recall, there was one financial company that lost
no transactions and clients did not even know one of the companies A-A
sites was in one of the towers on that tragic day 9/11.
You may be thinking of Canter & Fitzgerald, who were in the top of Tower
One above the impact zone. Unfortunately, they lost more than half of
their employees, but they didn't lose a single trade. I don't know if
their A-A system was VMS or not. There were a few VMSclusters with one
site in the WTC at the time.
Kerry Main
2017-06-10 21:35:19 UTC
Permalink
Raw Message
-----Original Message-----
Michael Moroney via Info-vax
Sent: June 10, 2017 3:05 PM
Subject: Re: [Info-vax] [OT] More details about the BA data centre
disaster
Post by Kerry Main
Case in point - as I recall, there was one financial company that lost
no transactions and clients did not even know one of the companies A-
A
Post by Kerry Main
sites was in one of the towers on that tragic day 9/11.
You may be thinking of Canter & Fitzgerald, who were in the top of Tower
One above the impact zone. Unfortunately, they lost more than half of
their employees, but they didn't lose a single trade. I don't know if
their A-A system was VMS or not. There were a few VMSclusters with one
site in the WTC at the time.
There are some key differentiators between disaster recovery and
disaster tolerance.

Disaster tolerance (DT) implies there is no impact on the business when
a site goes away. This is the area where a active-active (A-A)
multi-site OpenVMS cluster plays.

Disaster recovery (DR) means the business is impacted, but can recover
at some point. This is what most other platforms focus on as they have a
tougher time dealing with A-A solutions. Btw, OpenVMS can be configured
as A-P as well.

DR vs. DT: (mid 2003 article which applies today just as much as it did
back then)
<http://www.informationweek.com/disaster-recovery-versus-disaster-preven
tion/d/d-id/1020115>
" Analyst Jim Johnson of the Standish Group said enterprises rely too
extensively on recovery. "If you have to recover, you've lost time and
money." He says most companies are more cognizant of business continuity
but budget cuts have forced many to pull back. Says Johnson: "The goal
still needs to be an infrastructure so automatic that the time to
recover is zero."

Commerzbank on 9/11 is a good example of disaster tolerance (DT) using
OpenVMS:
<
http://www.availabilitydigest.com/public_articles/0407/commerzbank.pdf>
"However, Commerzbank had the foresight to distribute its
processing with an OpenVMS Active/Active Split-Site Cluster. Its
processing services continued uninterrupted in its alternate
location thirty miles away, allowing it to continue to provide
seamless service following the terrorist attack"

Cantor Fitzgerald is a good example of disaster recovery (DR).
<http://www.eweek.com/storage/espeed-lifts-cantor-fitzgerald>
<http://siteselection.com/ssinsider/bbdeal/bd040802.htm>

Keith Parris (most here know Keith) has kept up and delivered likely the
best DR availability sessions you can get these days - on any platform.

A few links to a few of Keith's many whitepapers / papers

Whitepaper: (must reading if getting into this area)
<
http://h20565.www2.hpe.com/hpsc/doc/public/display?docId=emr_na-c0462280
8>

Home site - tons of multi-site, DR/DT material:
<https://sites.google.com/site/keithparris>


Regards,

Kerry Main
Kerry dot main at starkgaming dot com

Loading...