Discussion:
Smartarray RAID 1 unit with spare, how check when spare has been consumed?
Add Reply
Rod Regier
2020-09-17 16:34:02 UTC
Reply
Permalink
I've configured a Smartarray RAID 1 unit with a spare disk on an RX2800i6 w/internal P410i (motherboard) controller (see below).

HP docs on the topic are "terse".

What would the status display for the Unit look like after the spare has been "consumed"?

I would like to write code to detect when the spare has been consumed...
(To permit followup remedial action).

Here is current status (all good)

Unit 1:
In PDLA mode, Unit 1 is Lun 1.
Cache status : enabled
Max Boot Partition: Unknown
Volume status : VOLUME OK
2 Data Disk(s) used by lun 1:
Disk 1: Partition 0; (SCSI bus 1, SCSI id 0)
Disk 2: Partition 0; (SCSI bus 1, SCSI id 1)
Spare physical drives:
1 Spare Disk(s) used by lun 1:
Disk 3: (SCSI bus 1, SCSI id 2)
Logical Volume Raid Level: RAID 1. Mirroring
stripe_size=128.0KB
Logical Volume Capacity : 232.85 [250.02] GB
Mark DeArman
2020-09-17 18:00:58 UTC
Reply
Permalink
On Thu, 17 Sep 2020 09:34:02 -0700 (PDT), Rod Regier
Post by Rod Regier
I've configured a Smartarray RAID 1 unit with a spare disk on an RX2800i6 w/internal P410i (motherboard) controller (see below).
HP docs on the topic are "terse".
What would the status display for the Unit look like after the spare has been "consumed"?
I would like to write code to detect when the spare has been consumed...
(To permit followup remedial action).
Here is current status (all good)
In PDLA mode, Unit 1 is Lun 1.
Cache status : enabled
Max Boot Partition: Unknown
Volume status : VOLUME OK
Disk 1: Partition 0; (SCSI bus 1, SCSI id 0)
Disk 2: Partition 0; (SCSI bus 1, SCSI id 1)
Disk 3: (SCSI bus 1, SCSI id 2)
Logical Volume Raid Level: RAID 1. Mirroring
stripe_size=128.0KB
Logical Volume Capacity : 232.85 [250.02] GB
From what I've seen, it wont be listed as a spare anymore and will be
moved into the volume.

Mark
Joukj
2020-09-18 06:30:09 UTC
Reply
Permalink
Post by Mark DeArman
On Thu, 17 Sep 2020 09:34:02 -0700 (PDT), Rod Regier
Post by Rod Regier
I've configured a Smartarray RAID 1 unit with a spare disk on an RX2800i6 w/internal P410i (motherboard) controller (see below).
HP docs on the topic are "terse".
What would the status display for the Unit look like after the spare has been "consumed"?
I would like to write code to detect when the spare has been consumed...
(To permit followup remedial action).
Here is current status (all good)
In PDLA mode, Unit 1 is Lun 1.
Cache status : enabled
Max Boot Partition: Unknown
Volume status : VOLUME OK
Disk 1: Partition 0; (SCSI bus 1, SCSI id 0)
Disk 2: Partition 0; (SCSI bus 1, SCSI id 1)
Disk 3: (SCSI bus 1, SCSI id 2)
Logical Volume Raid Level: RAID 1. Mirroring
stripe_size=128.0KB
Logical Volume Capacity : 232.85 [250.02] GB
From what I've seen, it wont be listed as a spare anymore and will be
moved into the volume.
Mark
And the "show disks" command does not show the disk anymore.
Rod Regier
2020-09-18 12:13:47 UTC
Reply
Permalink
Looks like I need to conduct an experiment which successive SHO UNIT and SHO DISK displays recorded as I configure and evolve the status of the array+spare.
Rod Regier
2020-09-21 16:09:13 UTC
Reply
Permalink
test sequence results:

RAID mirror set build with included spare:

MSA> add unit 1 /disk=(000,102)/spare=(001)/RAID=1

Result:

MSA> sho unit

Unit 1:
In PDLA mode, Unit 1 is Lun 1.
Cache status : enabled
Max Boot Partition: Unknown
Volume status : VOLUME OK
2 Data Disk(s) used by lun 1:
Disk 0: Partition 0; (SCSI bus 0, SCSI id 0)
Disk 102: Partition 0; (SCSI bus 1, SCSI id 2)
Spare physical drives:
1 Spare Disk(s) used by lun 1:
Disk 1: (SCSI bus 0, SCSI id 1)
Logical Volume Raid Level: RAID 1. Mirroring
stripe_size=128.0KB
Logical Volume Capacity : 68.48 [73.53] GB

Fail RAID mirror set by removing member mirror drive, result:

Unit 1:
In PDLA mode, Unit 1 is Lun 1.
Cache status : enabled
Max Boot Partition: Unknown
Volume status : VOLUME OK
1 Disk(s) Failed or Removed:
Disk 0: (SCSI bus 0, SCSI id 0)
2 Data Disk(s) used by lun 1:
Disk 0: Partition 255; (SCSI bus 0, SCSI id 0)
Disk 102: Partition 0; (SCSI bus 1, SCSI id 2)
Spare physical drives:
1 Spare Disk(s) used by lun 1:
Disk 1: (SCSI bus 0, SCSI id 1)
Logical Volume Raid Level: RAID 1. Mirroring
stripe_size=128.0KB
Logical Volume Capacity : 68.48 [73.53] GB

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\

Separately fail spare drive by removal and force detection of result

MSA> scan all

MSA> sho unit

Unit 1:
In PDLA mode, Unit 1 is Lun 1.
Cache status : enabled
Max Boot Partition: Unknown
Volume status : VOLUME OK
1 Disk(s) Failed or Removed:
Disk 1: (SCSI bus 0, SCSI id 1)
2 Data Disk(s) used by lun 1:
Disk 0: Partition 0; (SCSI bus 0, SCSI id 0)
Disk 102: Partition 0; (SCSI bus 1, SCSI id 2)
Spare physical drives:
1 Spare Disk(s) used by lun 1:
Disk 1: (SCSI bus 0, SCSI id 1)
Logical Volume Raid Level: RAID 1. Mirroring
stripe_size=128.0KB
Logical Volume Capacity : 68.48 [73.53] GB
Simon Clubley
2020-09-21 18:44:04 UTC
Reply
Permalink
Post by Rod Regier
MSA> add unit 1 /disk=(000,102)/spare=(001)/RAID=1
MSA> sho unit
In PDLA mode, Unit 1 is Lun 1.
Cache status : enabled
Max Boot Partition: Unknown
Volume status : VOLUME OK
Disk 0: Partition 0; (SCSI bus 0, SCSI id 0)
Disk 102: Partition 0; (SCSI bus 1, SCSI id 2)
Disk 1: (SCSI bus 0, SCSI id 1)
Logical Volume Raid Level: RAID 1. Mirroring
stripe_size=128.0KB
Logical Volume Capacity : 68.48 [73.53] GB
Those of you who use this utility on a regular basis clearly have
learned to read between the lines when trying to understand what it
is _really_ trying to tell you.

Here are some comments from someone who has never had to use MSA and
sees all the strange inconstancies in this output. Feel free to use
these comments if you wish to improve the output from this utility.

When I look at the output below, the use of "VOLUME OK" is obviously
some strange use of the word OK that I wasn't previously aware of.
(With apologies to Arthur Dent).

You've just failed a drive and it thinks everything is ok ?

That should say DEGRADED or similar, to go along with the failure
message below it.
Post by Rod Regier
In PDLA mode, Unit 1 is Lun 1.
Cache status : enabled
Max Boot Partition: Unknown
Volume status : VOLUME OK
Disk 0: (SCSI bus 0, SCSI id 0)
Disk 0: Partition 255; (SCSI bus 0, SCSI id 0)
Disk 102: Partition 0; (SCSI bus 1, SCSI id 2)
Disk 1: (SCSI bus 0, SCSI id 1)
Logical Volume Raid Level: RAID 1. Mirroring
stripe_size=128.0KB
Logical Volume Capacity : 68.48 [73.53] GB
I'm assuming the spare disk got promoted to an active RAID 1 member
when the bus 0, id 0 disk failed. Why is the spare disk still listed
as a spare disk instead of being listed as one of the data disks and
why is the spare physical drive entry not now listed as empty or not
otherwise marked as now being in use ?

I assume that after you read the manual, you will then learn at that
point that "Partition 255" really means this unit has failed.

I assume memory limits or similar stopped the designer of this utility
from adding the word "(FAILED)" at the end of that line in the list of
data disks and had to resort to displaying some special number instead.

I can see what it is trying to say, but the output is confusing to
someone not familiar with it. It seems to be a classic case of having
to learn what the output _really_ means instead of just having the
_current_ status correctly and clearly listed in an internally consistent
manner ready for you to read.
Post by Rod Regier
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
Separately fail spare drive by removal and force detection of result
MSA> scan all
MSA> sho unit
In PDLA mode, Unit 1 is Lun 1.
Cache status : enabled
Max Boot Partition: Unknown
Volume status : VOLUME OK
Disk 1: (SCSI bus 0, SCSI id 1)
Disk 0: Partition 0; (SCSI bus 0, SCSI id 0)
Disk 102: Partition 0; (SCSI bus 1, SCSI id 2)
Disk 1: (SCSI bus 0, SCSI id 1)
Logical Volume Raid Level: RAID 1. Mirroring
stripe_size=128.0KB
Logical Volume Capacity : 68.48 [73.53] GB
Likewise, you have just pulled the spare disk. Why is it still listed
as an available spare disk and why is the volume still listed as OK ?

Given the above output, you have to manually match up the bus and id
number of the failed drive with the list of drives to work out what
has failed. That is seriously yucky.

Simon.
--
Simon Clubley, ***@remove_me.eisner.decus.org-Earth.UFP
Walking destinations on a map are further away than they appear.
Rod Regier
2020-09-22 16:27:03 UTC
Reply
Permalink
Post by Simon Clubley
Post by Rod Regier
MSA> add unit 1 /disk=(000,102)/spare=(001)/RAID=1
MSA> sho unit
In PDLA mode, Unit 1 is Lun 1.
Cache status : enabled
Max Boot Partition: Unknown
Volume status : VOLUME OK
Disk 0: Partition 0; (SCSI bus 0, SCSI id 0)
Disk 102: Partition 0; (SCSI bus 1, SCSI id 2)
Disk 1: (SCSI bus 0, SCSI id 1)
Logical Volume Raid Level: RAID 1. Mirroring
stripe_size=128.0KB
Logical Volume Capacity : 68.48 [73.53] GB
Those of you who use this utility on a regular basis clearly have
learned to read between the lines when trying to understand what it
is _really_ trying to tell you.
Here are some comments from someone who has never had to use MSA and
sees all the strange inconstancies in this output. Feel free to use
these comments if you wish to improve the output from this utility.
RR: MSA$UTIL is an HPE / VSI maintained utility.
Post by Simon Clubley
When I look at the output below, the use of "VOLUME OK" is obviously
some strange use of the word OK that I wasn't previously aware of.
(With apologies to Arthur Dent).
RR: There are still two working drives for the RAID mirror logical volume. That makes the logical volume "OK".
Post by Simon Clubley
You've just failed a drive and it thinks everything is ok ?
That should say DEGRADED or similar, to go along with the failure
message below it.
RR: A RAID mirror logical volume with only *one* working drive displays as a DEGRADED status.
No example supplied in this thread (so far)
Post by Simon Clubley
Post by Rod Regier
In PDLA mode, Unit 1 is Lun 1.
Cache status : enabled
Max Boot Partition: Unknown
Volume status : VOLUME OK
Disk 0: (SCSI bus 0, SCSI id 0)
Disk 0: Partition 255; (SCSI bus 0, SCSI id 0)
Disk 102: Partition 0; (SCSI bus 1, SCSI id 2)
Disk 1: (SCSI bus 0, SCSI id 1)
Logical Volume Raid Level: RAID 1. Mirroring
stripe_size=128.0KB
Logical Volume Capacity : 68.48 [73.53] GB
I'm assuming the spare disk got promoted to an active RAID 1 member
when the bus 0, id 0 disk failed. Why is the spare disk still listed
as a spare disk instead of being listed as one of the data disks and
why is the spare physical drive entry not now listed as empty or not
otherwise marked as now being in use ?
RR: My guess is that the software designers decided not to shuffle around
the roles of the member disks as the status evolved so that when
replacement disks became available the roles could be repopulated by priority.
That is consistent with my testing as I supplied replacement drives.
Post by Simon Clubley
I assume that after you read the manual, you will then learn at that
point that "Partition 255" really means this unit has failed.
RR: Available documentation is silent on any of the MSA$UTIL displayed details.
The arcane partition detail is not necessary to determine what is going on.
Post by Simon Clubley
I assume memory limits or similar stopped the designer of this utility
from adding the word "(FAILED)" at the end of that line in the list of
data disks and had to resort to displaying some special number instead.
RR: failed disks are absent from the MSA>SHOW DISKS display
failed disks are described as such in the MSA>SHO UNIT display in a scatter fashion.
Post by Simon Clubley
I can see what it is trying to say, but the output is confusing to
someone not familiar with it. It seems to be a classic case of having
to learn what the output _really_ means instead of just having the
_current_ status correctly and clearly listed in an internally consistent
manner ready for you to read.
RR: minimal docs. Learn by experience seems to be the rule :-(
Post by Simon Clubley
Post by Rod Regier
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
Separately fail spare drive by removal and force detection of result
MSA> scan all
MSA> sho unit
In PDLA mode, Unit 1 is Lun 1.
Cache status : enabled
Max Boot Partition: Unknown
Volume status : VOLUME OK
Disk 1: (SCSI bus 0, SCSI id 1)
Disk 0: Partition 0; (SCSI bus 0, SCSI id 0)
Disk 102: Partition 0; (SCSI bus 1, SCSI id 2)
Disk 1: (SCSI bus 0, SCSI id 1)
Logical Volume Raid Level: RAID 1. Mirroring
stripe_size=128.0KB
Logical Volume Capacity : 68.48 [73.53] GB
Likewise, you have just pulled the spare disk. Why is it still listed
as an available spare disk and why is the volume still listed as OK ?
RR: The spare drive is described as failed, but the failure status is separated from the spare descriptor.
RAID mirror volume still has two work member drives, so volume status is still "OK".
Post by Simon Clubley
Given the above output, you have to manually match up the bus and id
number of the failed drive with the list of drives to work out what
has failed. That is seriously yucky.
RR: certainly not simple, but the raw detail is there to connect the dots.
Post by Simon Clubley
Simon.
--
Walking destinations on a map are further away than they appear.
RR: I've had to experiment a lot with a Smartarray controller and member disks
to learn how the MSA$UTIL displays evolve as disks fail and are re-provisioned.

BTW, erasing the front of a disk makes it look "factory new" to the controller
and a candidate to use for replacement of failed drives. This is handy for experimentation
so a single physical drive can be failed (pulled), front zeroed and re-inserted as a "new" drive.

I have a separate server with many different controllers and boot a Linux recovery USB key on it
to test and/or erase disks. Parallel SCSI, SAS, SATA.

Front zero a drive:

dd if=/dev/zero of=/dev/sdb bs=512 count=2048

Writes 1 MBy of zeros at the front of the drive.
Simon Clubley
2020-09-22 17:23:27 UTC
Reply
Permalink
Post by Rod Regier
Post by Simon Clubley
Those of you who use this utility on a regular basis clearly have
learned to read between the lines when trying to understand what it
is _really_ trying to tell you.
Here are some comments from someone who has never had to use MSA and
sees all the strange inconstancies in this output. Feel free to use
these comments if you wish to improve the output from this utility.
RR: MSA$UTIL is an HPE / VSI maintained utility.
Yes, I know. The "you" in this case meant VSI although reading that again
I wasn't as clear as I should have been...
Post by Rod Regier
Post by Simon Clubley
When I look at the output below, the use of "VOLUME OK" is obviously
some strange use of the word OK that I wasn't previously aware of.
(With apologies to Arthur Dent).
RR: There are still two working drives for the RAID mirror logical volume.
That makes the logical volume "OK".
I disagree; the volume is degraded as one level of redundancy has now
disappeared because the spare drive has been brought into active use
due to the failure of one of the drives.

The volume has been configured with two levels of redundancy - a RAID 1
configuration _and_ a spare drive. Anyone who configures that level of
redundancy in normal production use is saying that the data on that
volume is important enough to justify that level of multiple redundancy.

When the configured level of redundancy is not available due to a drive
failure, the volume most certainly is not ok because you have lost one
of the two levels of redundancy.

I would expect to see that volume marked as degraded until the configured
level of redundancy was restored in the same way as would happen if there
was only one available disk in a normal RAID 1 configuration because you
had lost the other disk.

Simon.
--
Simon Clubley, ***@remove_me.eisner.decus.org-Earth.UFP
Walking destinations on a map are further away than they appear.
abrsvc
2020-09-22 18:22:37 UTC
Reply
Permalink
Post by Simon Clubley
Post by Rod Regier
Post by Simon Clubley
Those of you who use this utility on a regular basis clearly have
learned to read between the lines when trying to understand what it
is _really_ trying to tell you.
Here are some comments from someone who has never had to use MSA and
sees all the strange inconstancies in this output. Feel free to use
these comments if you wish to improve the output from this utility.
RR: MSA$UTIL is an HPE / VSI maintained utility.
Yes, I know. The "you" in this case meant VSI although reading that again
I wasn't as clear as I should have been...
Post by Rod Regier
Post by Simon Clubley
When I look at the output below, the use of "VOLUME OK" is obviously
some strange use of the word OK that I wasn't previously aware of.
(With apologies to Arthur Dent).
RR: There are still two working drives for the RAID mirror logical volume.
That makes the logical volume "OK".
I disagree; the volume is degraded as one level of redundancy has now
disappeared because the spare drive has been brought into active use
due to the failure of one of the drives.
The volume has been configured with two levels of redundancy - a RAID 1
configuration _and_ a spare drive. Anyone who configures that level of
redundancy in normal production use is saying that the data on that
volume is important enough to justify that level of multiple redundancy.
When the configured level of redundancy is not available due to a drive
failure, the volume most certainly is not ok because you have lost one
of the two levels of redundancy.
I would expect to see that volume marked as degraded until the configured
level of redundancy was restored in the same way as would happen if there
was only one available disk in a normal RAID 1 configuration because you
had lost the other disk.
Simon.
--
Walking destinations on a map are further away than they appear.
This may be a case of semantics. From the viewpoint of the user, the volume is NOT degraded as the system is working as expected. From a hardware view, I agree that the "system" is degraded as all of the bits are no longer there in full. I take this view from years of dealing with customers that know nothing of how hardware works (nor care). In these cases, suggesting that the system is degraded will create unnecessary concern at the consumer level. Were this state to remain uncorrected for a long period of time, that concern would be justified. If this can be resolved quickly with no interruption in service, then there is no reason for creating that concern. Yes the problem and its resolution should recorded/reported, but normally there is no reason to cause potential concern at the customer/user level.
Loading...