Richard Jordan
2024-10-18 18:26:53 UTC
RX2800 i4 server, 64GB RAM, 4 processors, P410i controller with 10 each
2TB disks in RAID 6, broken down into volumes.
We periodically (sometimes steady once a week, but sometimes more
frequent) one overnight batch job take much longer than normal to run.
Normal runtime about 30-35 minutes will take 4.5 - 6.5 hours. Several
images called by that job all run much slower than normal. At the end
the overall CPU and I/O counts are very close between a normal and a
long job.
The data files are very large indexed files. Records are read and
updated but not added in this job; output is just tabulated reports.
We've run monitors for all and disk and also built polling snapshot jobs
that check for locked/busy files, other active batch jobs, auto-checked
through system analyzer looking for any other processes accessing the
busy files at the same time as the problem batch (two data files show
long busy periods but we do not show any other process with channels to
that file at the same time except for backup, see next).
The backups start at the same time, but do not get to the data disks
until well after the problem job normally completes; that does cause
concurrent access to the problem files but it occurs only when the job
has already run long. so it is not the cause Overall backup time is
about the same regardless of how long the problem batch takes.
Monitor during a long run shows average and peak I/O rates to the disks
with busy files at about 1/2 of what they do for normal runs. We can
see that in the process snapshots too; the direct i/o count on a slow
run increases much more slowly than on a normal run but both normal and
long runs end up with close to the same CPU time and total I/Os.
Other jobs in monitor are somewhat slowed down but nowhere near as much
(and they do much less access).
Before anyone asks, the indexed files could probably use a
cleanup/rebuild, but if thats the cause would we see periodic
performance issues? I would expect them to be constant.
There is a backup server available, so I'm going to restore backups of
the two problem files to it and do rebuilds to see how long it takes;
that will determine how/when we can do it on the production server.
So something is apparently causing it to be I/O constrained but so far
we can't find it. Same concurrent processes, other jobs don't appear to
be slowed down much (but may be much less i/o sensitive or using data
on other disks, I threw that question to the devs).
Is there anything in the background below VMS that could cause this?
The controller doing drive checks or other maintenance activities?
Thanks for any ideas.
2TB disks in RAID 6, broken down into volumes.
We periodically (sometimes steady once a week, but sometimes more
frequent) one overnight batch job take much longer than normal to run.
Normal runtime about 30-35 minutes will take 4.5 - 6.5 hours. Several
images called by that job all run much slower than normal. At the end
the overall CPU and I/O counts are very close between a normal and a
long job.
The data files are very large indexed files. Records are read and
updated but not added in this job; output is just tabulated reports.
We've run monitors for all and disk and also built polling snapshot jobs
that check for locked/busy files, other active batch jobs, auto-checked
through system analyzer looking for any other processes accessing the
busy files at the same time as the problem batch (two data files show
long busy periods but we do not show any other process with channels to
that file at the same time except for backup, see next).
The backups start at the same time, but do not get to the data disks
until well after the problem job normally completes; that does cause
concurrent access to the problem files but it occurs only when the job
has already run long. so it is not the cause Overall backup time is
about the same regardless of how long the problem batch takes.
Monitor during a long run shows average and peak I/O rates to the disks
with busy files at about 1/2 of what they do for normal runs. We can
see that in the process snapshots too; the direct i/o count on a slow
run increases much more slowly than on a normal run but both normal and
long runs end up with close to the same CPU time and total I/Os.
Other jobs in monitor are somewhat slowed down but nowhere near as much
(and they do much less access).
Before anyone asks, the indexed files could probably use a
cleanup/rebuild, but if thats the cause would we see periodic
performance issues? I would expect them to be constant.
There is a backup server available, so I'm going to restore backups of
the two problem files to it and do rebuilds to see how long it takes;
that will determine how/when we can do it on the production server.
So something is apparently causing it to be I/O constrained but so far
we can't find it. Same concurrent processes, other jobs don't appear to
be slowed down much (but may be much less i/o sensitive or using data
on other disks, I threw that question to the devs).
Is there anything in the background below VMS that could cause this?
The controller doing drive checks or other maintenance activities?
Thanks for any ideas.