Discussion:
COD inconsistency
(too old to reply)
Phillip Helbig (undress to reply)
2020-08-26 22:28:38 UTC
Permalink
%%%%%%%%%%% OPCOM 27-AUG-2020 00:24:57.58 %%%%%%%%%%% WRITE_LOG_FILE
The OpenVMS host software has requested the preservation
of the following OPCOM message data here in the operator log file.
No user action is normally required.

"OPCOM has noticed a COD inconsistency.
Count(7AEEB778) = Alloc(00000000) - (Busy(00038624) + Wait(00000000) + Free(000F40BC) + Dealloc(00000000)."

I've never seen the above before (I know, I should stay in more). It
appears to be mostly harmless, but it can't be completely meaningless
otherwise it wouldn't be there. What gives?
Volker Halle
2020-08-27 06:37:04 UTC
Permalink
Phillip,

[from OPCOM source listing] A COD is an OPCOM internal datastructure: 'Cluster Output Descriptor', most likely used for distributing OPCOM messages inside a cluster. OPCOM has noticed an inconsistency regarding the counts of CODs in the various internal queues.

If you very carefully look at the message, you'll see a minor problem with the value of 'Count' reported:

... Count(7AEEB778) = ...

As you can guess, the value seems to be the address of the variable count instead of it's value ;-) There is a missing '.' in front of the variable name in the BLISS source code [OPCOM]TIMESTAMP - so you've found another day-0 bug in OpenVMS !

Other than that, there is probably nothing you want or can do about this.

Volker.
Phillip Helbig (undress to reply)
2020-08-27 10:30:15 UTC
Permalink
Post by Volker Halle
so you've found another day-0 bug in OpenVMS !
The appropriate people should contact me regarding the details of the
transfer of my award. :-)
Stephen Hoffman
2020-08-27 14:04:52 UTC
Permalink
Post by Volker Halle
If you very carefully look at the message, you'll see a minor problem
All I can see are the mismatched braces.
--
Pure Personal Opinion | HoffmanLabs LLC
Volker Halle
2020-08-28 15:04:56 UTC
Permalink
Phillip,

this 'COD inconsistency' check is supposed to be performed every 5 minutes. Did you get this message in your cluster more than once at regular intervals ?

Busy(00038624) means, that there were 230948. CODs outstanding, which represent OPCOM messages to be sent to another node in the cluster. Were there any hanging OPCOM or CLUSTER_SERVER processes in your cluster at that time ?

The value reported for the local variable COD_COUNT (here: Count(7AEEB778)) is invalid, it's the P1 address of the local variable on the stack, not the correct value of COD_COUNT itself. This message is displayed if COD_COUNT is .NE. 0

Volker.
Phillip Helbig (undress to reply)
2020-08-28 16:08:03 UTC
Permalink
Post by Volker Halle
this 'COD inconsistency' check is supposed to be performed every 5
minutes. Did you get this message in your cluster more than once at
regular intervals ?
Yes, every 5 minutes, but only since 22-DEC-2019, and only on one node,
which is the node running 8.4. Nothing out of the ordinary happened on
22-DEC-2019.
Post by Volker Halle
Busy(00038624) means, that there were 230948. CODs outstanding, which
represent OPCOM messages to be sent to another node in the cluster. Were
there any hanging OPCOM or CLUSTER_SERVER processes in your cluster at
that time ?
I boot and shut down a satellite about once a day. But other than that,
I can't think of anything which would have a bearing on this.
Post by Volker Halle
Count(7AEEB778)) is invalid, it's the P1 address of the local variable
on the stack, not the correct value of COD_COUNT itself. This message is
displayed if COD_COUNT is .NE. 0
OK.
Volker Halle
2020-08-28 17:59:35 UTC
Permalink
Phillip,

with some analysis and SDA commands, one could probably find the CODs and related messages in the virtual address space of the OPCOM process.

Do the reported values change every 5 minutes or are they still the same as on 22-DEC-2019 ? If they don't change, then 'something' really happened (in the 5-minute-interval BEFORE the first occurence of this message) on 22-DEC-2019, which caused this inconsistency.

And if you don't like these messages anymore, try to STOP/ID the OPCOM process and restart it (from the SYSTEM account) using:

$ @SYS$SYSTEM:STARTUP OPCOM

Volker.
Phillip Helbig (undress to reply)
2020-08-28 22:17:46 UTC
Permalink
Post by Volker Halle
with some analysis and SDA commands, one could probably find the CODs and
related messages in the virtual address space of the OPCOM process.
Worth doing?
Post by Volker Halle
Do the reported values change every 5 minutes or are they still the same as
on 22-DEC-2019 ? If they don't change, then 'something' really happened
(in the 5-minute-interval BEFORE the first occurence of this message) on
22-DEC-2019, which caused this inconsistency.
The messages are all identical.
Volker Halle
2020-08-29 09:05:22 UTC
Permalink
Post by Phillip Helbig (undress to reply)
Worth doing?
It's not much work, please try this:

$ ANALYZE/SYS
SDA> SET PROC OPCOM
SDA> exa 00040910;50 ! offset for V8.3, may also work for V8.4
00000000 00000000 00000000 00000000 ................ 00000000.00040910
00000001 00000010 00000080 00000000 ................ 00000000.00040920
00000000 00000000 00000000 00000000 ................ 00000000.00040930
00040948 00040948 00040940 00040940 @***@...H...H... 00000000.00040940
00000000 00000000 00040950 00040950 P...P........... 00000000.00040950


The first 2 longwords are COD_ALLOCATED and COD_FREE_COUNT. In an idle OPCOM process, those values should be the same.
The 3 longwords starting at VA 40924 are constants and could be used with SDA> SEARCH to find this data, should the P0 addresses be different in V8.4.

The quadwords at 40940 and 40948 are the queue headers of the COD_FREE_QUEUE and COD_BUSY_QUEUE. If the queues are empty, those longwords contain the address of the queue header (as shown above).

I'm starting to believe, that the values shown in your OPCOM message, are not real counts, but the results of local memory corruption in OPCOM P0 space.

Volker.
Volker Halle
2020-08-29 09:35:32 UTC
Permalink
Interesting...

I can easliy provoke this message, when modifying the COD_ALLOCATED count (offsets are for V8.2) in OPCOM:

AXPVMS $ run sys$share:delta ! internal process ID for OPCOM is 0001000F
OpenVMS Alpha DELTA Debugger

Exit 00000001

80071674! LDQ R28,#X0008(SP) 0001000F:1;m
00000001
00040770/00000008 9

40770/00000009

exit

%%%%%%%%%%% OPCOM 29-AUG-2020 11:19:36.01 %%%%%%%%%%% WRITE_LOG_FILE
OPCOM has noticed a condition which might be due to an internal error.
It might also be explained by normal events, especially if nodes have just
crashed or rebooted in a VAXcluster. Please bring this message to Compaq's
attention only if you are having problems with operator communications.
"OPCOM has noticed a COD inconsistency.
Count(7AFD37A8) = Alloc(00000000) - (Busy(00038474) + Wait(00000000) + Free(00000000) + Dealloc(00000000)."

So it looks like the whole creation and output of this message is completely bogus. The value shown for Busy is actually the address of the descriptor of the $FAO control-string:

SDA> exa 38474
OPCOM+38474: 010E0074 "t..."
SDA> exa
OPCOM+38478: 00000000.00038400 "........"
SDA> exa 38400;74
65636974 6F6E2073 6168204D 4F43504F OPCOM has notice 00000000.00038400
7369736E 6F636E69 20444F43 20612064 d a COD inconsis 00000000.00038410
58212874 6E756F43 2F212E79 636E6574 tency.!/Count(!X 00000000.00038420
20294C58 2128636F 6C6C4120 3D20294C L) = Alloc(!XL) 00000000.00038430
57202B20 294C5821 28797375 4228202D - (Busy(!XL) + W 00000000.00038440
28656572 46202B20 294C5821 28746961 ait(!XL) + Free( 00000000.00038450
2128636F 6C6C6165 44202B20 294C5821 !XL) + Dealloc(! 00000000.00038460
00000000 00038400 010E0074 2E294C58 XL).t........... 00000000.00038470

Volker.
Phillip Helbig (undress to reply)
2020-08-29 19:36:08 UTC
Permalink
Post by Volker Halle
Post by Phillip Helbig (undress to reply)
Worth doing?
$ ANALYZE/SYS
SDA> SET PROC OPCOM
SDA> exa 00040910;50 ! offset for V8.3, may also work for V8.4
00000000 00000000 00000000 00000000 ................ 00000000.00040910
00000001 00000010 00000080 00000000 ................ 00000000.00040920
00000000 00000000 00000000 00000000 ................ 00000000.00040930
00000000 00000000 00040950 00040950 P...P........... 00000000.00040950
I get

SDA> SET PROC OPCOM
SDA> SET PROC OPCOM
SDA> exa 00040910;50
00000000 00000000 00000067 000000F8 ø...g........... 00000000.00040910
00000001 00000010 00000080 00000000 ................ 00000000.00040920
007E99BB 007E9C7F 00000015 000002C4 Ä.........~.».~. 00000000.00040930
00040948 00040948 000E0760 000C1620 ...`...H...H... 00000000.00040940
00000002 00000000 00040950 00040950 P...P........... 00000000.00040950
Volker Halle
2020-08-30 07:12:15 UTC
Permalink
Post by Volker Halle
I get
SDA> SET PROC OPCOM
SDA> SET PROC OPCOM
SDA> exa 00040910;50
00000000 00000000 00000067 000000F8 ø...g........... 00000000.00040910
00000001 00000010 00000080 00000000 ................ 00000000.00040920
007E99BB 007E9C7F 00000015 000002C4 Ä.........~.».~. 00000000.00040930
00040948 00040948 000E0760 000C1620 ...`...H...H... 00000000.00040940
00000002 00000000 00040950 00040950 P...P........... 00000000.00040950
Phillip,

the offset to the COD data structures is correct:

40910: 000000F8 COD_ALLOCATED
40914: 00000067 COD_FREE_COUNT
40918: 00000000 COD_DEAL_COUNT
4091C: 00000000 COD_WAIT_COUNT
40920: 00000000 COD_BUSY_COUNT

The next 3 longwords are the constants COD_FREE_MAX, COD_BUSY_MAX, COD_BUSY_NODE

40930: 000002C4 COD_ERRORS
40934: 00000015 COD_FLUSHED
40938: 007E9C7F COD_REQUESTS
4093C: 007E99BB COD_QUEUED
40940: 000C1620 COD_FREE_QUEUE non-empty
40944: 000E0760 ...
40948: 00040948 COD_BUSY_QUEUE is empty
4094C: 00040948
40950: 00040950 ACK_WAIT_Q is empty
40954: 00040950

So there is a difference between COD_ALLOCATED and the sum of Free, Deall, Wait and Busy, which is causing the 'COD inconsistency' message to be issued.

Please validate the free queue, to check, whether the no. of elements in the free queue match COD_FREE_COUNT

SDA> VALI QUE 40490

Volker.
Volker Halle
2020-08-30 07:13:24 UTC
Permalink
Correction:

SDA> VALI QUE 40940

Volker.

John H. Reinhardt
2020-08-29 20:25:35 UTC
Permalink
Post by Volker Halle
Post by Phillip Helbig (undress to reply)
Worth doing?
$ ANALYZE/SYS
SDA> SET PROC OPCOM
SDA> exa 00040910;50 ! offset for V8.3, may also work for V8.4
00000000 00000000 00000000 00000000 ................ 00000000.00040910
00000001 00000010 00000080 00000000 ................ 00000000.00040920
00000000 00000000 00000000 00000000 ................ 00000000.00040930
00000000 00000000 00040950 00040950 P...P........... 00000000.00040950
The first 2 longwords are COD_ALLOCATED and COD_FREE_COUNT. In an idle OPCOM process, those values should be the same.
The 3 longwords starting at VA 40924 are constants and could be used with SDA> SEARCH to find this data, should the P0 addresses be different in V8.4.
The quadwords at 40940 and 40948 are the queue headers of the COD_FREE_QUEUE and COD_BUSY_QUEUE. If the queues are empty, those longwords contain the address of the queue header (as shown above).
I'm starting to believe, that the values shown in your OPCOM message, are not real counts, but the results of local memory corruption in OPCOM P0 space.
Volker.
For what it's worth, here is VSI V8.4-2L1 Alpha

$ show system/noproc
OpenVMS V8.4-2L1 on node GORDON 29-AUG-2020 15:24:43.38 Uptime 0 22:40:06
$
$ analy/sys

OpenVMS system analyzer

SDA> SET PROC OPCOM
SDA> exa 00040910;50
00000000 00000000 00000000 00000000 ................ 00000000.00040910
00000001 00000010 00000080 00000000 ................ 00000000.00040920
00000000 00000000 00000000 00000000 ................ 00000000.00040930
00040948 00040948 00040940 00040940 @***@...H...H... 00000000.00040940
00000000 00000000 00040950 00040950 P...P........... 00000000.00040950
--
John H. Reinhardt
Volker Halle
2020-08-30 06:54:12 UTC
Permalink
Post by John H. Reinhardt
For what it's worth, here is VSI V8.4-2L1 Alpha
$ show system/noproc
OpenVMS V8.4-2L1 on node GORDON 29-AUG-2020 15:24:43.38 Uptime 0 22:40:06
$
$ analy/sys
OpenVMS system analyzer
SDA> SET PROC OPCOM
SDA> exa 00040910;50
00000000 00000000 00000000 00000000 ................ 00000000.00040910
00000001 00000010 00000080 00000000 ................ 00000000.00040920
00000000 00000000 00000000 00000000 ................ 00000000.00040930
00000000 00000000 00040950 00040950 P...P........... 00000000.00040950
--
John H. Reinhardt
John,

the offset to the COD data structures is correct (in V8.4-2L1), but your system is not a member of a cluster. A COD is a 'Cluster Output Descriptor' and only used in a cluster environment.

Volker.
Loading...