Discussion:
Transient anal/disk errors
(too old to reply)
tadamsmar
2013-12-12 14:39:04 UTC
I started running ANAL/DISK a lot more lately on our systems:

One of them gives transient warnings like this pretty often:

%ANALDISK-W-BADHEADER, file (41450,2154,0)
invalid file header
-ANALDISK-I-HEADER_DEL, deleted file header
-ANALDISK-I-DELHEADER_BUSY, deleted file header marked "busy"
in index file bitmap
%ANALDISK-W-ALLOCCLR, blocks incorrectly marked allocated
LBN 1435455 to 1435489, RVN 1
%ANALDISK-W-BADDIRENT, invalid file identification in directory entry
[CELEES]ALARM.TMP;1
-ANALDISK-I-BAD_DIRHEADER, no valid file header for directory
%ANALDISK-W-FREESPADRIFT, free block count of 11099270 is incorrect (RVN 1);
the correct value is 11099305

If I run it again, the messages go away or change. This is a shadowed disk that is not logging errors.

The other 4 systems run ANAL/DISK clean, if they have transient warnings at all then it must be at a much lower rate. All 5 systems are V7.3-2. The one with the transients is a DS10 466mhz. The others are DS10s 466 or 600 and one is an AS800. They all are running in essentially the same operating system configuration. Maybe there is more application level activity on the one with the transients, not sure.

PS: They all give this informational, and always have:
%ANALDISK-I-OPENQUOTA, error opening QUOTA.SYS
-SYSTEM-W-NOSUCHFILE, no such file

How does one get rid of the OPENQUOTA statement?

Thank in advance for any input.
abrsvc
2013-12-12 14:47:25 UTC
Post by tadamsmar
%ANALDISK-I-OPENQUOTA, error opening QUOTA.SYS
-SYSTEM-W-NOSUCHFILE, no such file
How does one get rid of the OPENQUOTA statement?
Thank in advance for any input.
This error indicates that diskquotas are not enabled. You can easily eliminate this informational error by enabling disk quotas with extremely high values. The high value will insure that quotas are not really enforced and avoid the message. Please note that it is informational only and harms nothing. the Quota file only exists when diskquotas are used.

Dan
Jan-Erik Soderholm
2013-12-12 15:03:45 UTC
%ANALDISK-W-BADHEADER, file (41450,2154,0) invalid file header....
If I run it again, the messages go away or change...
Maybe there is more application
level activity on the one with the transients, not sure.
I don't know that either.
But if it was *my* systems, I would check.
Stephen Hoffman
2013-12-12 15:07:28 UTC
You're looking at a live disk, with active changes, and not a static
environment. That'll inherently generate diagnostics on an ODS-2 or
ODS-5 file system.

If you really want to pursue this, write some tools to scan for severe
and fatal errors, and mask the expected errors.
Post by tadamsmar
If I run it again, the messages go away or change. This is a shadowed
disk that is not logging errors.
Those are the typical sorts of chatter that arise with an active disk.

So have you finished working on your backup strategy, and have you
recently tested a recovery-restart from that? (This is vastly more
important than analyzing your disks, as ANALYZE /DISK is reactive and
as it doesn't spot impending failures, RAID doesn't protect against
various common errors including volume corruptions larger than what it
can handle, file deletions or any sorts of intentional theft or
corruption or damage that might occur. The BACKUPs allow for recovery.

There's also the infamous BACKUP /IGNORE=INTERLOCK command, which some
folks think is an online BACKUP. It's not. Worse, it allows silent
data corruptions in the output savesets. If you have control over the
applications involved, that's where the BACKUP support needs to reside,
particularly if your applications are writing clumps of updates to
disk. Various relational databases on VMS include application-internal
backup tools, and always use those in preference to using the OpenVMS
BACKUP command. Alternatively, quiesce the applications or the disks
or the systems, and then use the standard BACKUP tools. Or quiesce the
environment and yank a disk from the shadowset, and backup that.
That's a much smaller window of downtime. Test the recovery process
periodically.
Post by tadamsmar
The other 4 systems run ANAL/DISK clean, if they have transient
warnings at all then it must be at a much lower rate.
I'd look to replace all of the disks in this configuration, just
because most of them are probably as old as those boxes. As good as
the old DEC SCSI disks were, statistically, they're failure fodder
given their likely relative ages.
Post by tadamsmar
All 5 systems are V7.3-2.
Ancient.
Post by tadamsmar
The one with the transients is a DS10 466mhz.
Shut it down, boot from CD or a backup disk, and try again. Quiesce
the environment, in other words.
Post by tadamsmar
The others are DS10s 466 or 600 and one is an AS800.
The arsenal of the ancient, eh? One rx2660 would likely easily replace
this whole configuration. Less power, less space, more capacity.
Maybe two with a low-end FC SAN or shared SCSI connection for the boot
and quorum disk, if you're in an uptime-critical environment.
Post by tadamsmar
They all are running in essentially the same operating system configuration.
"Essentially" is a particularly loaded word when debugging stuff. It's
those "essential" differences that often play into differences in how
bugs manifest themselves.
Post by tadamsmar
Maybe there is more application level activity on the one with the
transients, not sure.
Re-read the above listing of transients. There's your evidence.
Post by tadamsmar
%ANALDISK-I-OPENQUOTA, error opening QUOTA.SYS
-SYSTEM-W-NOSUCHFILE, no such file
How does one get rid of the OPENQUOTA statement?
Activate the disk quotas on the disk, rebuild, and set the limits past
the capacity of the disk, and take a slight performance hit tracking
the quotas. Or do what everybody else does here, and ignore it.
--
Pure Personal Opinion | HoffmanLabs LLC
tadamsmar
2013-12-12 15:51:16 UTC
Post by Stephen Hoffman
You're looking at a live disk, with active changes, and not a static
environment. That'll inherently generate diagnostics on an ODS-2 or
ODS-5 file system.
If you really want to pursue this, write some tools to scan for severe
and fatal errors, and mask the expected errors.
Post by tadamsmar
If I run it again, the messages go away or change. This is a shadowed
disk that is not logging errors.
Those are the typical sorts of chatter that arise with an active disk.
So have you finished working on your backup strategy, and have you
recently tested a recovery-restart from that? (This is vastly more
important than analyzing your disks, as ANALYZE /DISK is reactive and
as it doesn't spot impending failures, RAID doesn't protect against
various common errors including volume corruptions larger than what it
can handle, file deletions or any sorts of intentional theft or
corruption or damage that might occur. The BACKUPs allow for recovery.
There's also the infamous BACKUP /IGNORE=INTERLOCK command, which some
folks think is an online BACKUP. It's not. Worse, it allows silent
data corruptions in the output savesets. If you have control over the
applications involved, that's where the BACKUP support needs to reside,
particularly if your applications are writing clumps of updates to
disk. Various relational databases on VMS include application-internal
backup tools, and always use those in preference to using the OpenVMS
BACKUP command. Alternatively, quiesce the applications or the disks
or the systems, and then use the standard BACKUP tools. Or quiesce the
environment and yank a disk from the shadowset, and backup that.
You think I was recently working on my backup strategy? I was just working
on those persistent ANAL/DISK problems.

But I probably do need to work on my backup strategy. I have been yanking
out a disk without quiescing and backing up the yanked disk, and I have not done any deliberate recovery testing, just defacto when I had to recover a file or compress a disk. Just yanking a disk is easy, I just have to run command procedures, but as you point out, it might not have optimal reliability.

What's the easiest way to quiesce and yank? The only way I am sure of is to shutdown, boot with a CD, yank, then reboot normally. I am not sure that
there is a console command that will yank a disk from a shadowset, but I
seem to recall one that will disable shadowing.

I have noticed that sometimes a yanked disk will not run ANAL/DISK clean. This also seems to be transient.
Post by Stephen Hoffman
That's a much smaller window of downtime. Test the recovery process
periodically.
Post by tadamsmar
The other 4 systems run ANAL/DISK clean, if they have transient
warnings at all then it must be at a much lower rate.
I'd look to replace all of the disks in this configuration, just
because most of them are probably as old as those boxes. As good as
the old DEC SCSI disks were, statistically, they're failure fodder
given their likely relative ages.
Post by tadamsmar
All 5 systems are V7.3-2.
Ancient.
Me have no wampum for support for many moons, paleface.
Post by Stephen Hoffman
Post by tadamsmar
The one with the transients is a DS10 466mhz.
Shut it down, boot from CD or a backup disk, and try again. Quiesce
the environment, in other words.
Post by tadamsmar
The others are DS10s 466 or 600 and one is an AS800.
The arsenal of the ancient, eh? One rx2660 would likely easily replace
this whole configuration.
Heck, one DS10 600 could probably replace the whole thing.

There was this idea that running on 4 systems made a total failure less likely, so we spread it out over 4 systems plus a development system that could act also as a quickly configurable spare. But this was kind of pointless. Someone once brought almost all of it down by yanking on one thin wire which was THE thinwire. Now we have thickwire with one switch that revolutionized bringing down the system - it can be done remotely without yanking a cable. Or by one dead UPS battery or by unplugging the switch. All or most of these have happened.
Post by Stephen Hoffman
Less power, less space, more capacity.
Maybe two with a low-end FC SAN or shared SCSI connection for the boot
and quorum disk, if you're in an uptime-critical environment.
Post by tadamsmar
They all are running in essentially the same operating system configuration.
"Essentially" is a particularly loaded word when debugging stuff. It's
those "essential" differences that often play into differences in how
bugs manifest themselves.
Post by tadamsmar
Maybe there is more application level activity on the one with the
transients, not sure.
Re-read the above listing of transients. There's your evidence.
Yes, that one system is arguably more active today when I was doing my testing.
Post by Stephen Hoffman
Post by tadamsmar
%ANALDISK-I-OPENQUOTA, error opening QUOTA.SYS
-SYSTEM-W-NOSUCHFILE, no such file
How does one get rid of the OPENQUOTA statement?
Activate the disk quotas on the disk, rebuild, and set the limits past
the capacity of the disk, and take a slight performance hit tracking
the quotas. Or do what everybody else does here, and ignore it.
--
Pure Personal Opinion | HoffmanLabs LLC
Stephen Hoffman
2013-12-12 16:31:44 UTC
Post by tadamsmar
Post by Stephen Hoffman
There's also the infamous BACKUP /IGNORE=INTERLOCK command, which
some>> folks think is an online BACKUP. It's not. Worse, it allows
silent>> data corruptions in the output savesets. If you have control
over the>> applications involved, that's where the BACKUP support needs
to reside,>> particularly if your applications are writing clumps of
updates to>> disk. Various relational databases on VMS include
application-internal>> backup tools, and always use those in preference
to using the OpenVMS>> BACKUP command. Alternatively, quiesce the
applications or the disks>> or the systems, and then use the standard
BACKUP tools. Or quiesce the>> environment and yank a disk from the
shadowset, and backup that.
You think I was recently working on my backup strategy? I was just
working on those persistent ANAL/DISK problems.
Yes, so was I. In my admittedly skewed view of the world,
investigations of persistent disk errors are always secondary to having
good and verified backups. Preserve the most current data first, then
study the disks and the errors.
Post by tadamsmar
But I probably do need to work on my backup strategy. I have been
yanking out a disk without quiescing and backing up the yanked disk,
and I have not done any deliberate recovery testing, just defacto when
I had to recover a file or compress a disk. Just yanking a disk is
easy, I just have to run command procedures, but as you point out, it
might not have optimal reliability.
You're hot-plugging active disks, and probably in an environment
without a quiesce function on the storage controller?

Don't do that.

You've probably been causing some of the errors and corruptions here.
Post by tadamsmar
What's the easiest way to quiesce and yank?
Depending on the bus and the target, via DISMOUNT command. Some
storage controllers support a quiesce function, and others expect you
to shut down. I'm guessing your gear probably lacks one of those
controllers; that feature usually only exists on outboard storage
controllers. It's not a feature usually found with host-based JBOB
SCSI controllers, nor even necessarily on some of the host-based SCSI
RAID controllers.

But that's not how I'd do the backups I'm referring to. I'd DISMOUNT
the disk from the shadowset, and MOUNT /NOWRITE the disk privately, and
back up from there. There are minimerge and minicopy bitmaps that were
discussed here in massive detail when Phillip Helbig was trying to
understand how all that worked, so I'm not going to bother reposting
all of that here. Those features will help bring the
temporarily-removed disk back to current within the shadowset more
quickly. Search for threads with minicopy or minimerge or related
keywords via Google Groups, and start reading. Or check the current
volume shadowing manual in the OpenVMS documentation set. Or both.
Post by tadamsmar
The only way I am sure of is to shutdown, boot with a CD, yank, then
reboot normally.
That's the best way, official way, and only supported way, if you need
to reconfigure a SCSI, and lack a storage controller with a quiesce
function.
Post by tadamsmar
I am not sure that there is a console command that will yank a disk
from a shadowset, but I seem to recall one that will disable shadowing.
Allow me to translate "I don't recall" as "which manual should I read
to learn more about the fundamental operations of the server?". That'd
be the volume shadowing manual. <http://www.hp.com/go/openvms/doc>,
select the VMS documentation shelf, and search for "shadowing", and
skim that manual. You'll definitely need to be more familiar with it
if (when?) you decide to implement minicopy or minimerge. (Though your
VMS version is ancient, and there were definitely various patches made
available in this and related areas of OpenVMS.)
Post by tadamsmar
I have noticed that sometimes a yanked disk will not run ANAL/DISK
clean. This also seems to be transient.
Yeah. Sometimes yanking the disk just silently corrupts the file data
on that disk, depending on the timing. I wouldn't assume other disks
on the SCSI bus would be entirely immune from problems or corruptions,
either. Not without quiescing the bus, or shutting down, or
dismounting the disks on that bus.
--
Pure Personal Opinion | HoffmanLabs LLC
tadamsmar
2013-12-12 17:05:38 UTC
Post by Stephen Hoffman
Post by tadamsmar
Post by Stephen Hoffman
There's also the infamous BACKUP /IGNORE=INTERLOCK command, which
some>> folks think is an online BACKUP. It's not. Worse, it allows
silent>> data corruptions in the output savesets. If you have control
over the>> applications involved, that's where the BACKUP support needs
to reside,>> particularly if your applications are writing clumps of
updates to>> disk. Various relational databases on VMS include
application-internal>> backup tools, and always use those in preference
to using the OpenVMS>> BACKUP command. Alternatively, quiesce the
applications or the disks>> or the systems, and then use the standard
BACKUP tools. Or quiesce the>> environment and yank a disk from the
shadowset, and backup that.
You think I was recently working on my backup strategy? I was just
working on those persistent ANAL/DISK problems.
Yes, so was I. In my admittedly skewed view of the world,
investigations of persistent disk errors are always secondary to having
good and verified backups. Preserve the most current data first, then
study the disks and the errors.
Post by tadamsmar
But I probably do need to work on my backup strategy. I have been
yanking out a disk without quiescing and backing up the yanked disk,
and I have not done any deliberate recovery testing, just defacto when
I had to recover a file or compress a disk. Just yanking a disk is
easy, I just have to run command procedures, but as you point out, it
might not have optimal reliability.
You're hot-plugging active disks,
By "yanking" I just meant dismounting a disk from a shadowset. We only have one system, the AS800, that allow literal yanking of a physical disk.
Post by Stephen Hoffman
and probably in an environment
without a quiesce function on the storage controller?
Don't do that.
You've probably been causing some of the errors and corruptions here.
Post by tadamsmar
What's the easiest way to quiesce and yank?
Depending on the bus and the target, via DISMOUNT command. Some
storage controllers support a quiesce function, and others expect you
to shut down. I'm guessing your gear probably lacks one of those
controllers; that feature usually only exists on outboard storage
controllers. It's not a feature usually found with host-based JBOB
SCSI controllers, nor even necessarily on some of the host-based SCSI
RAID controllers.
dismount is what I meant by yank.
Post by Stephen Hoffman
But that's not how I'd do the backups I'm referring to. I'd DISMOUNT
the disk from the shadowset, and MOUNT /NOWRITE the disk privately, and
back up from there. There are minimerge and minicopy bitmaps that were
discussed here in massive detail when Phillip Helbig was trying to
understand how all that worked, so I'm not going to bother reposting
all of that here. Those features will help bring the
temporarily-removed disk back to current within the shadowset more
quickly. Search for threads with minicopy or minimerge or related
keywords via Google Groups, and start reading. Or check the current
volume shadowing manual in the OpenVMS documentation set. Or both.
I use /minicopy when I dismount for backing up.
Post by Stephen Hoffman
Post by tadamsmar
The only way I am sure of is to shutdown, boot with a CD, yank, then
reboot normally.
That's the best way, official way, and only supported way, if you need
to reconfigure a SCSI, and lack a storage controller with a quiesce
function.
Post by tadamsmar
I am not sure that there is a console command that will yank a disk
from a shadowset, but I seem to recall one that will disable shadowing.
Allow me to translate "I don't recall" as "which manual should I read
to learn more about the fundamental operations of the server?". That'd
be the volume shadowing manual. <http://www.hp.com/go/openvms/doc>,
select the VMS documentation shelf, and search for "shadowing", and
skim that manual. You'll definitely need to be more familiar with it
if (when?) you decide to implement minicopy or minimerge. (Though your
VMS version is ancient, and there were definitely various patches made
available in this and related areas of OpenVMS.)
Post by tadamsmar
I have noticed that sometimes a yanked disk will not run ANAL/DISK
clean. This also seems to be transient.
Yeah. Sometimes yanking the disk just silently corrupts the file data
on that disk, depending on the timing. I wouldn't assume other disks
on the SCSI bus would be entirely immune from problems or corruptions,
either. Not without quiescing the bus, or shutting down, or
dismounting the disks on that bus.
I meant dismounting. If I dismount from a shadowset and run anal/disk on the dismounted disk, I sometimes get warnings form anal/disk. I just recently started checking this as I have gotten a bit more concerned after I had persistent warnings from anal/disk on one system that required a good bit of cleanup.

So perhaps even a dismount from a shadowset is a bit risky for backing up. Of course, it's a heck of a lot easier than the official way to prep a disk for backup.
Post by Stephen Hoffman
--
Pure Personal Opinion | HoffmanLabs LLC
Stephen Hoffman
2013-12-12 18:51:43 UTC
Post by tadamsmar
If I dismount from a shadowset and run anal/disk on the dismounted
disk, I sometimes get warnings form anal/disk.
If the applications are not quiesced or shut down when the shadowset
member is removed from the shadowset, then some errors and
inconsistencies can arise.
Post by tadamsmar
I just recently started checking this as I have gotten a bit more
concerned after I had persistent warnings from anal/disk on one system
that required a good bit of cleanup.
I'd start by replacing the disks, then a combination of wholesale
server replacement, and a software upgrade. If you have the
prerequisites available, probably with Itanium. You're running on
hardware that I wouldn't recommend to most hobbyists, and you're
spending far too much time keeping it going based on the various
postings with the same problems over and over again, which usually
means that the technical and IT discussions involved here are secondary
to some non-technical or managerial issues (usually funding, but there
can be other triggers), and that's not a fun place to be when you're
trying to keep fossil-grade servers available and operational.
Post by tadamsmar
So perhaps even a dismount from a shadowset is a bit risky for backing
up. Of course, it's a heck of a lot easier than the official way to
prep a disk for backup.
If your disks are dismounted when the applications are quiesced, and
you are getting corruptions, then some combination of hardware and
software are failing you.
--
Pure Personal Opinion | HoffmanLabs LLC
tadamsmar
2013-12-18 19:04:39 UTC
Post by Stephen Hoffman
Post by tadamsmar
If I dismount from a shadowset and run anal/disk on the dismounted
disk, I sometimes get warnings form anal/disk.
If the applications are not quiesced or shut down when the shadowset
member is removed from the shadowset, then some errors and
inconsistencies can arise.
Post by tadamsmar
I just recently started checking this as I have gotten a bit more
concerned after I had persistent warnings from anal/disk on one system
that required a good bit of cleanup.
I'd start by replacing the disks, then a combination of wholesale
server replacement, and a software upgrade.
If you have the
prerequisites available, probably with Itanium.
Prerequisites being willingness to pay a bunch to go from one dead end
legacy system to another.
Post by Stephen Hoffman
You're running on
hardware that I wouldn't recommend to most hobbyists, and you're
spending far too much time keeping it going based on the various
postings with the same problems over and over again,
We have had few problems and have spent little time on them overall.
Post by Stephen Hoffman
which usually
means that the technical and IT discussions involved here are secondary
to some non-technical or managerial issues (usually funding, but there
can be other triggers), and that's not a fun place to be when you're
trying to keep fossil-grade servers available and operational.
Post by tadamsmar
So perhaps even a dismount from a shadowset is a bit risky for backing
up. Of course, it's a heck of a lot easier than the official way to
prep a disk for backup.
If your disks are dismounted when the applications are quiesced, and
you are getting corruptions, then some combination of hardware and
software are failing you.
I am not sure I had corruptions under those circumstances.

The persistent ANAL/DISK warnings arose after I had shifted the disks around to
two systems, something might have gone wrong then. Anyway, I am treating it like one-off special case but I am monitoring for problems. And I am doing a bunch of stuff to better manage one off special cases, and to minimize the need to move disk around.

I bought 3 bare-bones systems to put in storage as backups, that will mitigate the need to touch other running systems (to use them as backups) if one system fails.

Also, I am establishing better manual and automatic procedures to check disks for errors and ANAL/DISK warnings during any procedure that involves breaking up a shadowsets and moving physical disks. I do a daily routine check for errors, but I need to check repeatedly at various points when I have to do something atypical with a disk.

At least, if I had been more careful and did more checking while I was making changes I would know exactly when the persistent ANAL/DISK warnings arose.
Post by Stephen Hoffman
--
Pure Personal Opinion | HoffmanLabs LLC