Discussion:
Reverse telnet printers (TNAxxxx:) owned by non-existent process
Add Reply
Scott Snadow
2021-06-01 16:22:47 UTC
Reply
Permalink
Hello everyone, after the odd DCL "READ/TIMEOUT=n" issue that I posted a couple of weeks ago, I have another problem to ask about. Unfortunately, unlike the last issue, I don't have any hypothesis about what's causing the problem nor how to prevent it.

So: I'm running OpenVMS Alpha V7.3-2; HP TCP/IP V5.4 with ECO 6. I have a few hundred "reverse Telnet" printers defined on the server. These have TNAxxxx: devices created, but they do not have VMS-level queues. They're used by code that simply opens the TNA device, writes to it, and - at some point - closes it. None of them are set "spooled." The actual printers are varied: Mostly HP LaserJets and OKI laser printers, plus an assortment of other brands and models. Most use port 9100. To make things interesting, some are serial printers connected to Lantronix terminal servers, and those use other port numbers (10001 and above.) The network infrastructure is Cisco routers and switches, shared with plenty of other servers running Windows or Linux or AIX. This has been in place for years, relatively trouble free.

But intermittently, a printer will have its TNA device decide that it is owned by a non-existent process: SHOW DEVICE/FULL will show a non-zero PID, but a null process name. SHOW PROCESS/ID=xxxxxxxx on the PID will give a non-existent process message. If we wait, rarely the problem will resolve itself, but most of the time (perhaps 95+%) it does not. When it does not clear itself, we know that we'll eventually get a phone call from users that are complaining that their reports aren't printing. This problem seems to occur on average perhaps 3 or 4 times a day. It occurs on the busier printers more often than the infrequently used printers.

In no particular order, we've come up with three ways that we can _usually_ "fix" this:
1) Delete and re-create the TNA device (TELNET /DELETE_SESSION, TELNET /CREATE_SESSION)
2) From a privileged account, copy any file to the TNA device (such as COPY NL: TNAxxxx:)
3) Reboot (Obviously this is not at all desirable, but it's guaranteed to work!)

As a workaround, I've set up a DCL batch job that checks all TNA printer devices every five minutes with F$GETDVI, and if it find one with a non-zero owner PID, and F$GETJPI on that PID returns a non-existent process status, I copy NL: to the TNA: device and check the owner PID again. So far this works reliably, with the re-checked PID coming up as zero.

But that's a hack. I'd much rather prevent the problem from occurring in the first place. Any ideas on what causes this and/or how to stop it?

Thanks,
Scott
Grant Taylor
2021-06-01 16:45:53 UTC
Reply
Permalink
Post by Scott Snadow
But that's a hack. I'd much rather prevent the problem from occurring
in the first place. Any ideas on what causes this and/or how to
stop it?
I'm just lobbing this out there.

Is there any chance that the problem is associated with timing and how
quickly print jobs are sent to the TNAs? As in if jobs are sent close
enough together / back to back that the connection(s) / print jobs
(maybe on the printer's end) are causing state to be wrong? E.g. two
print jobs close enough together / back to back that the second one ends
up re-using the first one's established connection?

Where I'm going with this is if the second one tries to clear the
connection it thinks it established, it probably can't because the data
is wrong. Similarly the first print job can't clear it's connection
because it's in use by the second print job.

I don't know. I'm just thinking out loud.

If this, or something timing related, is close to the problem, I'd think
that you could probably reproduce this on demand once you figure out the
problem criteria.
--
Grant. . . .
unix || die
Simon Clubley
2021-06-01 17:23:58 UTC
Reply
Permalink
Post by Scott Snadow
Hello everyone, after the odd DCL "READ/TIMEOUT=n" issue that I posted a couple of weeks ago, I have another problem to ask about. Unfortunately, unlike the last issue, I don't have any hypothesis about what's causing the problem nor how to prevent it.
So: I'm running OpenVMS Alpha V7.3-2; HP TCP/IP V5.4 with ECO 6. I have a few hundred "reverse Telnet" printers defined on the server. These have TNAxxxx: devices created, but they do not have VMS-level queues. They're used by code that simply opens the TNA device, writes to it, and - at some point - closes it. None of them are set "spooled." The actual printers are varied: Mostly HP LaserJets and OKI laser printers, plus an assortment of other brands and models. Most use port 9100. To make things interesting, some are serial printers connected to Lantronix terminal servers, and those use other port numbers (10001 and above.) The network infrastructure is Cisco routers and switches, shared with plenty of other servers running Windows or Linux or AIX. This has been in place for years, relatively trouble free.
But intermittently, a printer will have its TNA device decide that it is owned by a non-existent process: SHOW DEVICE/FULL will show a non-zero PID, but a null process name. SHOW PROCESS/ID=xxxxxxxx on the PID will give a non-existent process message. If we wait, rarely the problem will resolve itself, but most of the time (perhaps 95+%) it does not. When it does not clear itself, we know that we'll eventually get a phone call from users that are complaining that their reports aren't printing. This problem seems to occur on average perhaps 3 or 4 times a day. It occurs on the busier printers more often than the infrequently used printers.
1) Delete and re-create the TNA device (TELNET /DELETE_SESSION, TELNET /CREATE_SESSION)
2) From a privileged account, copy any file to the TNA device (such as COPY NL: TNAxxxx:)
3) Reboot (Obviously this is not at all desirable, but it's guaranteed to work!)
Does power cycling the printer clear the problem ?
Post by Scott Snadow
As a workaround, I've set up a DCL batch job that checks all TNA printer devices every five minutes with F$GETDVI, and if it find one with a non-zero owner PID, and F$GETJPI on that PID returns a non-existent process status, I copy NL: to the TNA: device and check the owner PID again. So far this works reliably, with the re-checked PID coming up as zero.
But that's a hack. I'd much rather prevent the problem from occurring in the first place. Any ideas on what causes this and/or how to stop it?
Does the underlying socket still exist on the VMS system and if so,
what state is the socket in ?

My guess would be that the socket close sequence has gone wrong and
that the underlying socket is stuck in some closing state.

Do you have TCP-level keepalives enabled on the system in question ?

If not, have you tried enabling them ?

Is this only on serial printers or only on network printers or is it
a mixture of the two ?

It's been a while since I used reverse Telnet printers, but is there
some timeout setting you can apply to the TNA device itself when you
create the device ?

Simon.
--
Simon Clubley, ***@remove_me.eisner.decus.org-Earth.UFP
Walking destinations on a map are further away than they appear.
Stephen Hoffman
2021-06-01 17:30:16 UTC
Reply
Permalink
Post by Scott Snadow
But that's a hack. I'd much rather prevent the problem from occurring
in the first place. Any ideas on what causes this and/or how to stop
it?
You're still on OpenVMS Alpha V7.3-2, so arguably it's all a hack.

Try to spool the devices if this is straight output and not a mix of
output and input, and see if that reduces the window when these issues
can arise.

Don't get SHARE privilege involved, as SHARE is how these problems tend
to arise. Use of SHARE and the last-channel deassign logic tend to
interact poorly. There have been patches in this area, too.

There are kernel-mode hacks to clear device ownership and some of which
have undoubtedly been posted here, if that COPY NL: hack should fail to
work.
--
Pure Personal Opinion | HoffmanLabs LLC
Phil Howell
2021-06-02 03:48:13 UTC
Reply
Permalink
Hello everyone, after the odd DCL "READ/TIMEOUT=n" issue that I posted a couple of weeks ago, I have another problem to ask about. Unfortunately, unlike the last issue, I don't have any hypothesis about what's causing the problem nor how to prevent it.
So: I'm running OpenVMS Alpha V7.3-2; HP TCP/IP V5.4 with ECO 6. I have a few hundred "reverse Telnet" printers defined on the server. These have TNAxxxx: devices created, but they do not have VMS-level queues. They're used by code that simply opens the TNA device, writes to it, and - at some point - closes it. None of them are set "spooled." The actual printers are varied: Mostly HP LaserJets and OKI laser printers, plus an assortment of other brands and models. Most use port 9100. To make things interesting, some are serial printers connected to Lantronix terminal servers, and those use other port numbers (10001 and above.) The network infrastructure is Cisco routers and switches, shared with plenty of other servers running Windows or Linux or AIX. This has been in place for years, relatively trouble free.
But intermittently, a printer will have its TNA device decide that it is owned by a non-existent process: SHOW DEVICE/FULL will show a non-zero PID, but a null process name. SHOW PROCESS/ID=xxxxxxxx on the PID will give a non-existent process message. If we wait, rarely the problem will resolve itself, but most of the time (perhaps 95+%) it does not. When it does not clear itself, we know that we'll eventually get a phone call from users that are complaining that their reports aren't printing. This problem seems to occur on average perhaps 3 or 4 times a day. It occurs on the busier printers more often than the infrequently used printers.
1) Delete and re-create the TNA device (TELNET /DELETE_SESSION, TELNET /CREATE_SESSION)
2) From a privileged account, copy any file to the TNA device (such as COPY NL: TNAxxxx:)
3) Reboot (Obviously this is not at all desirable, but it's guaranteed to work!)
As a workaround, I've set up a DCL batch job that checks all TNA printer devices every five minutes with F$GETDVI, and if it find one with a non-zero owner PID, and F$GETJPI on that PID returns a non-existent process status, I copy NL: to the TNA: device and check the owner PID again. So far this works reliably, with the re-checked PID coming up as zero.
But that's a hack. I'd much rather prevent the problem from occurring in the first place. Any ideas on what causes this and/or how to stop it?
Thanks,
Scott
Didn't Jan-Erick have this problem a couple of years ago?
Jan-Erik Söderholm
2021-06-02 08:15:14 UTC
Reply
Permalink
Post by Phil Howell
Hello everyone, after the odd DCL "READ/TIMEOUT=n" issue that I posted a couple of weeks ago, I have another problem to ask about. Unfortunately, unlike the last issue, I don't have any hypothesis about what's causing the problem nor how to prevent it.
So: I'm running OpenVMS Alpha V7.3-2; HP TCP/IP V5.4 with ECO 6. I have a few hundred "reverse Telnet" printers defined on the server. These have TNAxxxx: devices created, but they do not have VMS-level queues. They're used by code that simply opens the TNA device, writes to it, and - at some point - closes it. None of them are set "spooled." The actual printers are varied: Mostly HP LaserJets and OKI laser printers, plus an assortment of other brands and models. Most use port 9100. To make things interesting, some are serial printers connected to Lantronix terminal servers, and those use other port numbers (10001 and above.) The network infrastructure is Cisco routers and switches, shared with plenty of other servers running Windows or Linux or AIX. This has been in place for years, relatively trouble free.
But intermittently, a printer will have its TNA device decide that it is owned by a non-existent process: SHOW DEVICE/FULL will show a non-zero PID, but a null process name. SHOW PROCESS/ID=xxxxxxxx on the PID will give a non-existent process message. If we wait, rarely the problem will resolve itself, but most of the time (perhaps 95+%) it does not. When it does not clear itself, we know that we'll eventually get a phone call from users that are complaining that their reports aren't printing. This problem seems to occur on average perhaps 3 or 4 times a day. It occurs on the busier printers more often than the infrequently used printers.
1) Delete and re-create the TNA device (TELNET /DELETE_SESSION, TELNET /CREATE_SESSION)
2) From a privileged account, copy any file to the TNA device (such as COPY NL: TNAxxxx:)
3) Reboot (Obviously this is not at all desirable, but it's guaranteed to work!)
As a workaround, I've set up a DCL batch job that checks all TNA printer devices every five minutes with F$GETDVI, and if it find one with a non-zero owner PID, and F$GETJPI on that PID returns a non-existent process status, I copy NL: to the TNA: device and check the owner PID again. So far this works reliably, with the re-checked PID coming up as zero.
But that's a hack. I'd much rather prevent the problem from occurring in the first place. Any ideas on what causes this and/or how to stop it?
Thanks,
Scott
Didn't Jan-Erick have this problem a couple of years ago?
He he... :-)
I was just going to write a note about that yesterday, but...
Funny someone remembers that.

Well, my view on this is that a process did an I/O operating (usually
a write, but maybe it can be a read also) against an TNA device and
then for some reason the process died when the I/O was still waiting.

Then you will get a TNA device with an "owner" that doesn't exist.
This prevents the "telnet /delete" on that port. Our usual work-around
is to edit the process startup script and use another (free) TNA device.

I have also been looking at a script that uses a range of TNA devices
and just use a new one each time the a process is restarted. Not finished.
Jilly
2021-06-02 16:25:45 UTC
Reply
Permalink
Snipped most of the post
But intermittently, a printer will have its TNA device decide that it is ow=
ned by a non-existent process: SHOW DEVICE/FULL will show a non-zero PID, b=
ut a null process name. SHOW PROCESS/ID=3Dxxxxxxxx on the PID will give a =
non-existent process message.
Snipped the rest
Thanks,
Scott
Remember that the 'non-existent process message' may not be technically
correct. If the SHOW command cannot queue an AST to the process you'll still
get this message even though the process does exist. Next time use
SDA> SHOW PROCESS/INDEX={pid} to look for the process.

Loading...