Discussion:
What happens when your sysadmins don't have VMS style discipline...
(too old to reply)
Simon Clubley
2017-02-01 13:50:09 UTC
Permalink
Raw Message
The following is _well_ worth a read:

https://www.theregister.co.uk/2017/02/01/gitlab_data_loss/

Basically, someone made a mistake and deleted the production database.
While bad, that happens. What shouldn't happen however is that their
various multiple backup options all had problems then they tried to
use them to restore the lost data.

There's a very detailed list of all the screwups in the above article.
I recommend you are not eating or drinking anything while reading it.

That /VERIFY qualifier is in VMS backup for a very good reason. It would
be nice if all the fashion of the month backup systems (and the current
crop of sysadmins) understood this as well.

Some people think that sysadmins should have less detailed knowledge
of the systems under their control and it should be more of a push-button
type system management. If any of those people work for GitLab, I wonder
if they still think that ?

Simon.
--
Simon Clubley, ***@remove_me.eisner.decus.org-Earth.UFP
Microsoft: Bringing you 1980s technology to a 21st century world
abrsvc
2017-02-01 13:57:56 UTC
Permalink
Raw Message
Without even reading hte article, I can tell you that many clients that I have had over the years are faithful in creating backups. Sometimes even multiple backups...

However, they rarely (if ever) attempt a restore. Having backups is a good thing, but not verifying that the backups are good or that a restore will work is irresponsible. I'm willing to bet that the number of sites that attempt restore verification can be counted on 1 or 2 hands.

Dan
IanD
2017-02-01 14:47:55 UTC
Permalink
Raw Message
Post by abrsvc
Without even reading hte article, I can tell you that many clients that I have had over the years are faithful in creating backups. Sometimes even multiple backups...
However, they rarely (if ever) attempt a restore. Having backups is a good thing, but not verifying that the backups are good or that a restore will work is irresponsible. I'm willing to bet that the number of sites that attempt restore verification can be counted on 1 or 2 hands.
Dan
One place I worked in did an rdb restore straight after the backup to prove that the backup was viable

Sadly after a number of years, some bean-counter decided that the extra effort, disk space etc was not worth it and canned the whole process

After that, somewhere along the line the same bean counter got the smart notion that the VMS camp would be forced over to use data protector for all backups and to stop individual OS backups on all platforms

Wind the clock forward, throw in a db corruption and then when the backups were called upon, DP failed to get the most recent copy back despite the backup status as being listed on the cell server as being valid

An older version was found, restored and journals mostly restored (some of these could not be restored either!)

Days of data were lost which had to be manually reworked into the system, taking weeks to do and delaying all sorts of downstream activities and customer impacts

The craziest thing is the lesson was not learnt and the same backup system was kept in place because going back to the original method was deemed as not following the company directive for backups!

Those folks putting out those glossy brouchers telling everyone just how fantastic their backup systems are still get listen too - versus those who have a restore strategy - it seems making people believe they are covered by backups alone sells more than having them think the pessimistically restore way of viewing things
Ian Miller
2017-02-01 16:19:52 UTC
Permalink
Raw Message
Post by IanD
Post by abrsvc
Without even reading hte article, I can tell you that many clients that I have had over the years are faithful in creating backups. Sometimes even multiple backups...
However, they rarely (if ever) attempt a restore. Having backups is a good thing, but not verifying that the backups are good or that a restore will work is irresponsible. I'm willing to bet that the number of sites that attempt restore verification can be counted on 1 or 2 hands.
Dan
One place I worked in did an rdb restore straight after the backup to prove that the backup was viable
Sadly after a number of years, some bean-counter decided that the extra effort, disk space etc was not worth it and canned the whole process
After that, somewhere along the line the same bean counter got the smart notion that the VMS camp would be forced over to use data protector for all backups and to stop individual OS backups on all platforms
Wind the clock forward, throw in a db corruption and then when the backups were called upon, DP failed to get the most recent copy back despite the backup status as being listed on the cell server as being valid
An older version was found, restored and journals mostly restored (some of these could not be restored either!)
Days of data were lost which had to be manually reworked into the system, taking weeks to do and delaying all sorts of downstream activities and customer impacts
The craziest thing is the lesson was not learnt and the same backup system was kept in place because going back to the original method was deemed as not following the company directive for backups!
Those folks putting out those glossy brouchers telling everyone just how fantastic their backup systems are still get listen too - versus those who have a restore strategy - it seems making people believe they are covered by backups alone sells more than having them think the pessimistically restore way of viewing things
I always say systems should have a restore strategy not a backup strategy, and that should be part of an overall data lifecycle strategy.

The data on a system is often the most valuable part of a system and may be one of the most valuable assets a company has - it is difficult to convince beancounters of this.
V***@SendSpamHere.ORG
2017-02-01 17:22:32 UTC
Permalink
Raw Message
Without even reading hte article, I can tell you that many clients that=
I have had over the years are faithful in creating backups. Sometimes eve=
n multiple backups... =20
=20
However, they rarely (if ever) attempt a restore. Having backups is a =
good thing, but not verifying that the backups are good or that a restore w=
ill work is irresponsible. I'm willing to bet that the number of sites tha=
t attempt restore verification can be counted on 1 or 2 hands.
=20
Dan
=20
One place I worked in did an rdb restore straight after the backup to pro=
ve that the backup was viable
=20
Sadly after a number of years, some bean-counter decided that the extra e=
ffort, disk space etc was not worth it and canned the whole process
=20
After that, somewhere along the line the same bean counter got the smart =
notion that the VMS camp would be forced over to use data protector for all=
backups and to stop individual OS backups on all platforms
=20
Wind the clock forward, throw in a db corruption and then when the backup=
s were called upon, DP failed to get the most recent copy back despite the =
backup status as being listed on the cell server as being valid
=20
An older version was found, restored and journals mostly restored (some o=
f these could not be restored either!)
=20
Days of data were lost which had to be manually reworked into the system,=
taking weeks to do and delaying all sorts of downstream activities and cus=
tomer impacts
=20
The craziest thing is the lesson was not learnt and the same backup syste=
m was kept in place because going back to the original method was deemed as=
not following the company directive for backups!
=20
Those folks putting out those glossy brouchers telling everyone just how =
fantastic their backup systems are still get listen too - versus those who =
have a restore strategy - it seems making people believe they are covered b=
y backups alone sells more than having them think the pessimistically resto=
re way of viewing things
I always say systems should have a restore strategy not a backup strategy, =
and that should be part of an overall data lifecycle strategy.
The data on a system is often the most valuable part of a system and may be=
one of the most valuable assets a company has - it is difficult to convinc=
e beancounters of this.
Data *IS* thee IT asset; the rest of IT is liability.
--
VAXman- A Bored Certified VMS Kernel Mode Hacker VAXman(at)TMESIS(dot)ORG

I speak to machines with the voice of humanity.
David Froble
2017-02-01 17:34:01 UTC
Permalink
Raw Message
Post by IanD
Post by abrsvc
Without even reading hte article, I can tell you that many clients that I have had over the years are faithful in creating backups. Sometimes even multiple backups...
However, they rarely (if ever) attempt a restore. Having backups is a good thing, but not verifying that the backups are good or that a restore will work is irresponsible. I'm willing to bet that the number of sites that attempt restore verification can be counted on 1 or 2 hands.
Dan
One place I worked in did an rdb restore straight after the backup to prove that the backup was viable
Sadly after a number of years, some bean-counter decided that the extra effort, disk space etc was not worth it and canned the whole process
After that, somewhere along the line the same bean counter got the smart notion that the VMS camp would be forced over to use data protector for all backups and to stop individual OS backups on all platforms
Wind the clock forward, throw in a db corruption and then when the backups were called upon, DP failed to get the most recent copy back despite the backup status as being listed on the cell server as being valid
An older version was found, restored and journals mostly restored (some of these could not be restored either!)
Days of data were lost which had to be manually reworked into the system, taking weeks to do and delaying all sorts of downstream activities and customer impacts
The craziest thing is the lesson was not learnt and the same backup system was kept in place because going back to the original method was deemed as not following the company directive for backups!
Those folks putting out those glossy brouchers telling everyone just how fantastic their backup systems are still get listen too - versus those who have a restore strategy - it seems making people believe they are covered by backups alone sells more than having them think the pessimistically restore way of viewing things
Bean counters ....

If there is ever to be an apocalypse, it will most likely be caused by some bean
counter.

And let me guess, in the above example, the bean counter was not considered the
problem, was he?
Rich Jordan
2017-02-01 17:40:03 UTC
Permalink
Raw Message
Post by IanD
Post by abrsvc
Without even reading hte article, I can tell you that many clients that I have had over the years are faithful in creating backups. Sometimes even multiple backups...
However, they rarely (if ever) attempt a restore. Having backups is a good thing, but not verifying that the backups are good or that a restore will work is irresponsible. I'm willing to bet that the number of sites that attempt restore verification can be counted on 1 or 2 hands.
Dan
One place I worked in did an rdb restore straight after the backup to prove that the backup was viable
Sadly after a number of years, some bean-counter decided that the extra effort, disk space etc was not worth it and canned the whole process
After that, somewhere along the line the same bean counter got the smart notion that the VMS camp would be forced over to use data protector for all backups and to stop individual OS backups on all platforms
Wind the clock forward, throw in a db corruption and then when the backups were called upon, DP failed to get the most recent copy back despite the backup status as being listed on the cell server as being valid
An older version was found, restored and journals mostly restored (some of these could not be restored either!)
Days of data were lost which had to be manually reworked into the system, taking weeks to do and delaying all sorts of downstream activities and customer impacts
The craziest thing is the lesson was not learnt and the same backup system was kept in place because going back to the original method was deemed as not following the company directive for backups!
Those folks putting out those glossy brouchers telling everyone just how fantastic their backup systems are still get listen too - versus those who have a restore strategy - it seems making people believe they are covered by backups alone sells more than having them think the pessimistically restore way of viewing things
I presume said beancounter was promoted and is now a highly placed executive and some poor IT bastards got scapegoated.
Phillip Helbig (undress to reply)
2017-02-02 08:43:33 UTC
Permalink
Raw Message
Post by abrsvc
Without even reading hte article, I can tell you that many clients that I
have had over the years are faithful in creating backups. Sometimes even
multiple backups...
Yes. At least two copies, stored at at least two sites.
Post by abrsvc
However, they rarely (if ever) attempt a restore. Having backups is a good
thing, but not verifying that the backups are good or that a restore will
work is irresponsible. I'm willing to bet that the number of sites that at
tempt restore verification can be counted on 1 or 2 hands.
A good point. The restore should at least be tested. Not necessarily
for every backup, but at least every time one changes anything in the
strategy.
Baldrick
2017-02-02 12:55:49 UTC
Permalink
Raw Message
In article <>,
Post by abrsvc
Without even reading hte article, I can tell you that many clients that I
have had over the years are faithful in creating backups. Sometimes even
multiple backups...
Yes. At least two copies, stored at at least two sites.
Post by abrsvc
However, they rarely (if ever) attempt a restore. Having backups is a good
thing, but not verifying that the backups are good or that a restore will
work is irresponsible. I'm willing to bet that the number of sites that at
tempt restore verification can be counted on 1 or 2 hands.
A good point. The restore should at least be tested. Not necessarily
for every backup, but at least every time one changes anything in the
strategy.
This was covered in my "You can't make this [stuff] up" at the bootcamp. How easy it can be to delete production databases, then the multitude of ways backups can fail to be restored, or even created in the first place, all real stories.

What puzzles me however [in the story] that there is some assumption that "moving to the cloud" is going to be any better? Why do so many have this illusion its some magical place where data is secure and never lost?

Remember folks, volume shadowing is NOT a substitute for BACKUP, HBVS is just a very convenient and fast way of replicating all your mistakes, files deleted or corrupted in the blink of an eye...

Baldrick
u***@gmail.com
2017-02-04 18:06:09 UTC
Permalink
Raw Message
Post by Simon Clubley
https://www.theregister.co.uk/2017/02/01/gitlab_data_loss/
Basically, someone made a mistake and deleted the production database.
While bad, that happens. What shouldn't happen however is that their
various multiple backup options all had problems then they tried to
use them to restore the lost data.
There's a very detailed list of all the screwups in the above article.
I recommend you are not eating or drinking anything while reading it.
That /VERIFY qualifier is in VMS backup for a very good reason. It would
be nice if all the fashion of the month backup systems (and the current
crop of sysadmins) understood this as well.
Some people think that sysadmins should have less detailed knowledge
of the systems under their control and it should be more of a push-button
type system management. If any of those people work for GitLab, I wonder
if they still think that ?
Simon.
--
Microsoft: Bringing you 1980s technology to a 21st century world
if you run 3 shadow disks and a cluster, dismount one for an image then remount back in to the shadow set, you will never have problems.
o***@gmail.com
2017-02-04 18:27:15 UTC
Permalink
Raw Message
Post by u***@gmail.com
if you run 3 shadow disks and a cluster, dismount one for an image then remount back in to the shadow set, you will never have problems.
For disaster recovery of your system disk, that's great. The most common case of recovery, though, is accidentally deleted user files where retrieval from a nightly incremental (especially if staged to disk before archiving on tape) is more practical.
David Froble
2017-02-04 19:08:45 UTC
Permalink
Raw Message
Post by u***@gmail.com
Post by Simon Clubley
https://www.theregister.co.uk/2017/02/01/gitlab_data_loss/
Basically, someone made a mistake and deleted the production database.
While bad, that happens. What shouldn't happen however is that their
various multiple backup options all had problems then they tried to
use them to restore the lost data.
There's a very detailed list of all the screwups in the above article.
I recommend you are not eating or drinking anything while reading it.
That /VERIFY qualifier is in VMS backup for a very good reason. It would
be nice if all the fashion of the month backup systems (and the current
crop of sysadmins) understood this as well.
Some people think that sysadmins should have less detailed knowledge
of the systems under their control and it should be more of a push-button
type system management. If any of those people work for GitLab, I wonder
if they still think that ?
Simon.
--
Microsoft: Bringing you 1980s technology to a 21st century world
if you run 3 shadow disks and a cluster, dismount one for an image then remount back in to the shadow set, you will never have problems.
I am still counting the ways I could screw that up. It'll take a while yet for
me to stop counting ....
Phillip Helbig (undress to reply)
2017-02-04 20:37:15 UTC
Permalink
Raw Message
Post by u***@gmail.com
if you run 3 shadow disks and a cluster, dismount one for an image
then remount back in to the shadow set, you will never have problems.
Not necessarily. If files are open for write, you can't be guaranteed a
clean snapshot.
Michael Moroney
2017-02-04 21:24:09 UTC
Permalink
Raw Message
Post by Phillip Helbig (undress to reply)
Post by u***@gmail.com
if you run 3 shadow disks and a cluster, dismount one for an image
then remount back in to the shadow set, you will never have problems.
Not necessarily. If files are open for write, you can't be guaranteed a
clean snapshot.
It will be as clean (or not clean) as if a system crash or power fail took
place at the point it was removed. Actually somewhat better, as shadowing
will at least quiesce the set (be sure all writes are complete or not
started) when the shadowing state is changed.

The proper way is to ensure all applications using the shadow set are
themselves quiesced (no files open for write) when removing the member.
abrsvc
2017-02-04 22:13:30 UTC
Permalink
Raw Message
Post by Michael Moroney
The proper way is to ensure all applications using the shadow set are
themselves quiesced (no files open for write) when removing the member.
To add to what Michael M posted:

The usual metric used for backups is "down time". In other words the time when an application is not available for use. Using the shadow member removal method, the time of "in-availability" is minimized. This method has been well received when proposed as it significantly reduces the application interruption times.

Dan
Phillip Helbig (undress to reply)
2017-02-05 07:28:47 UTC
Permalink
Raw Message
Post by Michael Moroney
Post by Phillip Helbig (undress to reply)
Post by u***@gmail.com
if you run 3 shadow disks and a cluster, dismount one for an image
then remount back in to the shadow set, you will never have problems.
Not necessarily. If files are open for write, you can't be guaranteed a
clean snapshot.
It will be as clean (or not clean) as if a system crash or power fail took
place at the point it was removed. Actually somewhat better, as shadowing
will at least quiesce the set (be sure all writes are complete or not
started) when the shadowing state is changed.
The proper way is to ensure all applications using the shadow set are
themselves quiesced (no files open for write) when removing the member.
True. However, this might mean shutting down several applications. It
also doesn't guard against users accidentally deleting files. And, of
course, the shadow-set members should be at different locations.
Stephen Hoffman
2017-02-06 16:21:56 UTC
Permalink
Raw Message
Post by Michael Moroney
Post by Phillip Helbig (undress to reply)
Post by u***@gmail.com
if you run 3 shadow disks and a cluster, dismount one for an image then
remount back in to the shadow set, you will never have problems.
Not necessarily. If files are open for write, you can't be guaranteed
a clean snapshot.
It will be as clean (or not clean) as if a system crash or power fail
took place at the point it was removed. Actually somewhat better, as
shadowing will at least quiesce the set (be sure all writes are
complete or not started) when the shadowing state is changed.
I wouldn't want to be recovering from crashes or BACKUP
/IGNORE=INTERLOCK backups as a routine part of my purpose-built backup
strategy, as — having dealt with the recovery from both — it gets ugly.

OpenVMS itself doesn't care about what parts of OpenVMS itself gets
stomped on, so the operating system near-always reboots and works after
a crash, modulo some disk storage that can require an extra step to
free and some logging data right around the crash that might get lost.
Applications can be rather more sensitive around what's written, or
not. And yes, crashes happen, but hopefully there's a good and
consistent backup ahead of that when recovery is needed.
Post by Michael Moroney
The proper way is to ensure all applications using the shadow set are
themselves quiesced (no files open for write) when removing the member.
Ayup. It's also common for apps to use multiple shadowsets, too.
There's no mechanism within OpenVMS available to allow cooperating
dismounts or coordinated backups, though that'll require work in both
the apps and OpenVMS.
--
Pure Personal Opinion | HoffmanLabs LLC
Stephen Hoffman
2017-02-06 16:06:01 UTC
Permalink
Raw Message
Post by u***@gmail.com
if you run 3 shadow disks and a cluster, dismount one for an image then
remount back in to the shadow set, you will never have problems.
Sure, if the dismounts alway happen at the boundaries of all active
transactions targeting that and all the other disks involved (which
isn't happening), or if the apps can be quiesced and caches flushed.

Which means... No, that approach isn't reliable. Not unless I can
quiesce the apps. Then it does work.

BTW, the shadowset limit has been six for a while now, not three.
--
Pure Personal Opinion | HoffmanLabs LLC
Loading...