Batch Queue Jobs Stuck In Starting Status

Discussion:

(too old to reply)

a***@floatingbear.ca

2006-11-23 15:32:14 UTC

We are running an Alpha and OpenVMS V7.1 and have for many years. We
occasionally have problems with job syncronization where the job we are
waiting for ends and the sycronize job just keeps waiting. The process
that is running is QUEMAN.

However, last night, after adding "just a couple" of more jobs to our
overnight processing, all of the queues stopped working. Quite a
number of jobs completed normally and about a dozen jobs were waiting
for a syncronize to a job that had already finished. HOWEVER all of
the rest of the jobs that should have been executing were sitting in a
"Starting" status. Trying to delete existing jobs had them go to an
"Aborting" status and hang. Stopping the queues also did not complete.

We ended up re-booting the system and rebuilding the queue manager
files and then manually re-submitting the jobs which are not chugging
along. About two years ago, we also started a practice of re-booting
the system once a month and rebuilding the queue manager files about
once a quarter. We last did that about a week ago.

Is anyone familiar with what might have caused our problems last night
with the jobs sitting as "Starting", is there something that we should
be doing that could resolve this problem? My task for today is to try
to reduce the number of jobs that are in the queue at any one time.

Thanks

Andrew Butchart
***@floatingbear.ca

Christoph Gartmann

2006-11-23 16:19:25 UTC

Permalink

Post by a***@floatingbear.ca
However, last night, after adding "just a couple" of more jobs to our
overnight processing, all of the queues stopped working. Quite a
number of jobs completed normally and about a dozen jobs were waiting
for a syncronize to a job that had already finished. HOWEVER all of
the rest of the jobs that should have been executing were sitting in a
"Starting" status. Trying to delete existing jobs had them go to an
"Aborting" status and hang. Stopping the queues also did not complete.

Just a few thoughts:
- are diskquotas enabled somewhere (either on the drive that holds the jobs
or on the drive that keeps the logs)?
- did you ever do an ANALYZE/DISK/REPAIR on the disk(s) involved?

Regards,
Christoph Gartmann

--
Max-Planck-Institut fuer Phone : +49-761-5108-464 Fax: -452
Immunbiologie
Postfach 1169 Internet: ***@immunbio dot mpg dot de
D-79011 Freiburg, Germany
http://www.immunbio.mpg.de/home/menue.html

Andrew Butchart

2006-11-23 17:13:52 UTC

Permalink

Christoph Gartmann wrote:
<snip>

Post by Christoph Gartmann
- are diskquotas enabled somewhere (either on the drive that holds the jobs
or on the drive that keeps the logs)?
- did you ever do an ANALYZE/DISK/REPAIR on the disk(s) involved?
Regards,
Christoph Gartmann

Christoph:

Thank you for your prompt reply. I believe the disk you are referring
to would be the one that contains the queue manager files -
sys$common:[sysexe]QMAN$MASTER.DAT - which on our system is the dkc100
volume.

show quota /disk=dkc100:
%SYSTEM-F-QFNOTACT, disk quotas not enabled on this volume
so no - we don't have disk quotas enabled - I didn't think so.

We haven't done an analyze on the drive. I just did now - without the
/repair parameter since I'm paranoid.
It turned up a lot of messages about files marked for delete was the
following messages and a few similar ones:
%ANALDISK-I-BADHIGHWATER, file (4098,255,1) PAGEFILE.SYS;1CL
inconsistent highwater mark and EFBLK
%ANALDISK-W-FUTCREDAT, file (11616,4702,1) [OVERNITE]00001A3B.TMP;1CL
creation date is in the future
%ANALDISK-W-ALLOCSET, blocks incorrectly marked freeCL LBN 2814210 to
2814244, RVN 1
%ANALDISK-W-ALLOCCLR, blocks incorrectly marked allocatedCL LBN
2814315 to 2814419, RVN 1
%ANALDISK-W-ALLOCSET, blocks incorrectly marked freeCL LBN 3356290 to
3356324, RVN 1

None of the messages marked with a ANALDISK-E- though.

There are other similar messages on the other disk devices but again,
nothing marked with an "-E-" and nothing referencing any of the QUEMAN
files.

Andrew B

Peter Weaver

2006-11-23 17:48:31 UTC

Permalink

----- Original Message -----
From: <***@floatingbear.ca>
To: <Info-***@Mvb.Saic.Com>
Sent: Thursday, November 23, 2006 10:32 AM
Subject: Batch Queue Jobs Stuck In Starting Status

Post by a***@floatingbear.ca
...
We ended up re-booting the system and rebuilding the queue manager
files and then manually re-submitting the jobs which are not chugging
along. About two years ago, we also started a practice of re-booting
the system once a month and rebuilding the queue manager files about
once a quarter. We last did that about a week ago.
...

That sounds like a very unusual thing to have to do. I have worked on many
large systems with large numbers of queues and I have never heard of anyone
needing to do reboot once a month or rebuild the queue database. Is it
possible that you have some program(s) out there trying to do something with
the queue files directly? Or maybe some programs that are trying to use
$SNDJBC incorrectly?

Peter Weaver
www.weaverconsulting.ca
CHARON-VAX CHARON-AXP DataStream Reflection PreciseMail

Andrew Butchart

2006-11-23 19:14:46 UTC

Permalink

Post by Peter Weaver
----- Original Message -----

<snip>

Post by Peter Weaver
Is it
possible that you have some program(s) out there trying to do something with
the queue files directly? Or maybe some programs that are trying to use
$SNDJBC incorrectly?
Peter Weaver
www.weaverconsulting.ca
CHARON-VAX CHARON-AXP DataStream Reflection PreciseMail

There doesn't appear to be anything. I've only been working with this
code for about 3 years - and don't know of anything that could possibly
have a reason to manipulate the queues directly.

Andrew B

d***@montagar.com

2006-11-23 19:31:02 UTC

Permalink

Post by a***@floatingbear.ca
We are running an Alpha and OpenVMS V7.1 and have for many years. We
occasionally have problems with job syncronization where the job we are
waiting for ends and the sycronize job just keeps waiting. The process
that is running is QUEMAN.

There could be some timing issues in these batch jobs. I.e. Batch job
FOO starts and re-submits itself for tomorrow. Batch job BAR wants to
SYNC on FOO, but executes the SYNC after FOO resubmits itself. The
SYNC is waiting on tomorrows run, not on todays run.

The stuck in "Starting" is a little more tricky. My guess is there is
a resource limitation causing the queue manager to not be able to start
the next job. Something like PROCESSLIMIT or soemthing in the UAF.

JF Mezei

2006-11-23 20:17:56 UTC

Permalink

Post by d***@montagar.com
The stuck in "Starting" is a little more tricky. My guess is there is
a resource limitation causing the queue manager to not be able to start
the next job. Something like PROCESSLIMIT or soemthing in the UAF.

I seem to recall in a distant memory having the same problem once. (while
merging two systems' batch queues). As I recall, it was queue manager related.

Make sure you don't have a partitioned queue manager (where node 1 has its
own database and node 2 has its own identical database and both have their
own queue manager).

Try a SHOW QUEUE/MANAGER/FULL on every node in your cluster and make sure
they all agree.

Andrew Butchart

2006-11-23 21:25:17 UTC

Permalink

JF Mezei wrote:
<snip>

Post by JF Mezei
I seem to recall in a distant memory having the same problem once. (while
merging two systems' batch queues). As I recall, it was queue manager related.
Make sure you don't have a partitioned queue manager (where node 1 has its
own database and node 2 has its own identical database and both have their
own queue manager).
Try a SHOW QUEUE/MANAGER/FULL on every node in your cluster and make sure
they all agree.

We have two nodes in the cluster (production and test) but they don't
share queues or pass jobs back and forth at all.

Andrew B

Peter Weaver

2006-11-23 22:28:34 UTC

Permalink

Post by Andrew Butchart
...
We have two nodes in the cluster (production and test) but they don't
share queues or pass jobs back and forth at all.
...

That may be part of your problem. According to section 11.4.2 of the V8.3
Guidelines for OpenVMS Cluster Configurations "Every OpenVMS Cluster has
only one QMAN$MASTER.DAT file. Multiple queue managers are defined through
multiple *.QMAN$QUEUES and *.QMAN$JOURNAL files." I would not want to try
using different QMAN$MASTER.DAT files in a cluster.

Peter Weaver
www.weaverconsulting.ca
CHARON-VAX CHARON-AXP DataStream Reflection PreciseMail

e***@yahoo.co.uk

2006-11-24 13:18:54 UTC

Permalink

I think Peter has it right. Unless the job controller or queue manager
(or the system disk for that matter) has a problem, I'd expect the next
step to be sort out the queue managers so that there's one
QMAN$MASTER.DAT and a different queue manager (which is a legal config)
on each node.

That said, I've seen jobs sitting in a STARTING state before now and
can't quite remember what the problem was. It may be worth checking
system params such as maxprocesses to make sure you're not nearer to
the limit than you thought you were.

Steve

Post by Peter Weaver

Post by Andrew Butchart
...
We have two nodes in the cluster (production and test) but they don't
share queues or pass jobs back and forth at all.
...

That may be part of your problem. According to section 11.4.2 of the V8.3
Guidelines for OpenVMS Cluster Configurations "Every OpenVMS Cluster has
only one QMAN$MASTER.DAT file. Multiple queue managers are defined through
multiple *.QMAN$QUEUES and *.QMAN$JOURNAL files." I would not want to try
using different QMAN$MASTER.DAT files in a cluster.
Peter Weaver
www.weaverconsulting.ca
CHARON-VAX CHARON-AXP DataStream Reflection PreciseMail

JF Mezei

2006-11-24 21:36:38 UTC

Permalink

Post by e***@yahoo.co.uk
That said, I've seen jobs sitting in a STARTING state before now and
can't quite remember what the problem was.

Consider this:

Inability to create log file, or access the submitted file result in an
immediate job cancellation and error message. Not a job hung in "starting"
state. And once the process has been created, the job is no longer in
"starting" state.

So the problem is likely occuring before the batch process has been
created, hence in the queue manager. I would think that having the two
separate queue managers in the same cluster would be an issue (probably
with locks). Another possible issue would be a blocked SYSUAF (opened
without r/w share by another app) which would prevent the queue manager
from checking the username and getting its details).

Andrew Butchart

2006-11-28 15:53:58 UTC

Permalink

Well - I've finally managed to persuade our operations people to do an
ana/disk/repair (took several meetings). Although there weren't any
bad "errors" found, a lot of mis-reported files did relate to old jobs.

Last night everything ran fine - my fingers are crossed for future
runs.

They are also looking into Peter's suggestion that he sent me off list
to check the patch level of the queuemanager but since we haven't
applied any patches to the OS since 1996 they're taking it slowly and
carefully.

Thanks everyone for the help.

Andrew B

Dave Gullen

2006-11-28 16:33:37 UTC

Permalink

Might be worth checking the size of the QMAN Journal file,
SYS$SYSTEM:QMAN$JOURNAL.DAT.

If it's very big, try this (undocumented in 7.1) command to shrink it.

$ MC JBC$COMMAND DIAG 7

Dave

Post by Andrew Butchart
Well - I've finally managed to persuade our operations people to do an
ana/disk/repair (took several meetings). Although there weren't any
bad "errors" found, a lot of mis-reported files did relate to old jobs.
Last night everything ran fine - my fingers are crossed for future
runs.
They are also looking into Peter's suggestion that he sent me off list
to check the patch level of the queuemanager but since we haven't
applied any patches to the OS since 1996 they're taking it slowly and
carefully.
Thanks everyone for the help.
Andrew B