[comp.sys.apollo] mysterious hint file on HP/Domain/OS

fleureck@imec.be (Marc Fleureck) (04/04/91)

On Domain/OS SR10.2 some machines seem to have a problem with the file 
/sys/node_data/hint_file.  For example, if I log out ( under DM ) on the 
machine itself, then go home to drink a cup of coffee and come back, 
I MIGHT be able to login again because the DM does not yet display its 
login pad.  In short, logout is terribly SLOW.
Solution : remove the hint file and reboot the machine.

Consultation of the manual tells me very little about the purpose of this
file in connection with this problem.  Could someone explain me the purpose
and mechanism of this file ?  Does it affect performance in any other way ?


Greetings,


		Marc Fleureck
		System Administration Unix
		IMEC vzw. 
		Kapeldreef 75, 3001 Heverlee
		BELGIUM
		Mail : fleureck@imec.be

rees@dabo.citi.umich.edu (Jim Rees) (04/05/91)

In article <885@imec.UUCP>, fleureck@imec.be (Marc Fleureck) writes:

  On Domain/OS SR10.2 some machines seem to have a problem with the file 
  /sys/node_data/hint_file.  For example, if I log out ( under DM ) on the 
  machine itself, then go home to drink a cup of coffee and come back, 
  I MIGHT be able to login again because the DM does not yet display its 
  login pad.  In short, logout is terribly SLOW.
  Solution : remove the hint file and reboot the machine.

When you want to open a file, Domain/OS first translates the name into a uid
(like a dev/inode).  Then it has to locate the node holding that object.  In
the old, old days, that was easy, because the uid contains the node id of
the node on which it was generated.

Then removable (floppy) disks came along, and it got harder.  Now you could
create a disk on one node, then mount it on another.  To find the objects on
that disk you have to go to a different node from the one identified in the
object uid.  Fortunately, the only way to get an object uid (usually) is by
resolving a name, and when you resolve a name you start at a known place and
work your way down, accumulating location information as you go.  Domain/OS
caches this location information in the hint_file so that when you go to
open (map) the object, it can be found.

Then internets came along, and it got way harder.  Now the hint file had to
be changed to hold network numbers as well as node ids.  The bad news is
that the hint file doesn't get flushed when it should, and stale information
is sometimes used after a floppy changes nodes or a node changes networks.
In general, you need to remove all hint files and reboot all nodes whenever
a node moves from one network to another, or net numbers change.

Things become slow because when Domain/OS has to try a remote network to
locate an object, it doesn't get a nack as it does on the local network.
Instead it has to time out.  If it has to time out several remote networks,
it can take a long time.

The hint file code is old and crufty and probably could use some work.

If you find that you have to remove the hint file at times when you haven't
changed your network topology, then something else may be wrong.  Check to
see that your nodes are cataloged correctly, with the right net id, in the
network root name server.

thompson@PAN.SSEC.HONEYWELL.COM (John Thompson) (04/05/91)

<<forwarded message>>
> On Domain/OS SR10.2 some machines seem to have a problem with the file 
> /sys/node_data/hint_file.  For example, if I log out ( under DM ) on the 
> machine itself, then go home to drink a cup of coffee and come back, 
> I MIGHT be able to login again because the DM does not yet display its 
> login pad.  In short, logout is terribly SLOW.
> Solution : remove the hint file and reboot the machine.

If I remember correctly, the hint_file contains information concerning the
network number(s) that the node is on (and that other nodes are?).  Have
these nodes been moved from one net to a new one recently?  Have they been 
INVOLed recently?  If so, you might have conflicting network IDs.  You
can discover this by running '/etc/rtsvc' on each node.  Note the network
ID that is listed.  Unless you're running internet domain, they should all
be the same.  (It's possible to run 2 separate domain nets on the same 
line, but it's advised against.)  

Assuming this is the problem, decide on a separate network ID for each of
the physical nets (ethernet, Apollo Token Ring, IBM Token Ring) that you
have, and then go around to each node that is conflicting with the ID.
Run "/etc/rtsvc -dev <DEVICE> -net <NETID> [-route | -noroute]"
substituting the device (e.g. RING, ETH802.3_AT -- it'll be the same as what
the rtsvc command displayed for you) and NETID for that device.  If you're
running internet domain, the gateway nodes will need to be told to
offer routing service (w/ '-route').  If any non-gateways are currently
offering it, turn it off (w/ '-noroute').  Otherwise, the -[no]route 
state is left at whatever it was.

After you do all this, you may have some problems with your llbd, glbd, or
rgyd processes.  Hopefully, the nodes running glbd and rgyd will be correct.
Otherwise, you will probably want to reboot them, and you may need to 
remove all but one glbd and rgyd, and then create the replicas again.

-- jt --
John Thompson
Honeywell, SSEC
Plymouth, MN  55441
thompson@pan.ssec.honeywell.com

Me?  Represent Honeywell?  You've GOT to be kidding!!!

hooft@prl.philips.nl (Peter van Hooft) (04/07/91)

|In <50cad431.cb12@dabo.citi.umich.edu> rees@dabo.citi.umich.edu (Jim Rees) writes:
|
|>In article <885@imec.UUCP>, fleureck@imec.be (Marc Fleureck) writes:
|
|>  On Domain/OS SR10.2 some machines seem to have a problem with the file 
|>  /sys/node_data/hint_file.  For example, if I log out ( under DM ) on the 
      .
      .
|In general, you need to remove all hint files and reboot all nodes whenever
|a node moves from one network to another, or net numbers change.

No. When you move a machine to another net, you only have to ctnode -root 
it, or "replace" the entry in the naming server database (using edns) with 
the new netid instead of the old one.

If I _had to_ change net numbers, I would do it like this:
1. IF I had an alternate gateway, I would set it up to route for the old netid
   ELSE I would shutdown the ns_helpers, glbds, and rgyds in the net to change,
   (so they would disappear out of the replica lists) and shut all machines 
   except the gateway.
2. edit the /etc/rc file on the gateway to reflect its new netid
3. use netsvc and rtsvc to change its netid, "replace" its entry in the naming 
   server in the other networks (using edns)
4. "init -from" a ns_helper, "create -from" a glbd on the gateway. 
5. IF I had set up an alternate gateway I would at this point one by one change
   the netid of the other machines using netsvc, "replace"-ing their entries in
   the naming server as I went along. Lastly I would switch off the routing on 
   the alternate gateway, then I would netsvc it.
   ELSE I would boot the other machines one at a time (they would pick up the
   netid of the gateway).
6. after all this I would restore the ns_helpers, glbds and rgyds to their 
   original location, shutting down the temporary ones on the gateway.

This shows you'd better plan _far_ ahead choosing a unique netid :-).
Comments anyone on this procedure ?

|Things become slow because when Domain/OS has to try a remote network to
|locate an object, it doesn't get a nack as it does on the local network.
|Instead it has to time out.  If it has to time out several remote networks,
|it can take a long time.

It should only have to time out when the node is not reachable (remote node 
down, remote network down, naming server info incorrect and ,yes, hint_file 
corrupt).

|The hint file code is old and crufty and probably could use some work.

|If you find that you have to remove the hint file at times when you haven't
|changed your network topology, then something else may be wrong.  Check to
|see that your nodes are cataloged correctly, with the right net id, in the
|network root name server.

And, firstly, at all sr* releases check you have an ns_helper running in each 
network, and at sr10* that you have "at least one" glbd running in each network.

Peter van Hooft    Philips Research Labs, Eindhoven, the Netherlands
Email: hooft@prl.philips.nl SERI: HOOFT:NLWAYA01 Voice: +31 40744327 
X400:  /PN=PJG.VanHooft/O=research/PRMD=philips400/ADMD=400net/C=nl/

rees@dabo.citi.umich.edu (Jim Rees) (04/08/91)

In article <2679@prles2.prl.philips.nl>, hooft@prl.philips.nl (Peter van Hooft) writes:

  |In general, you need to remove all hint files and reboot all nodes whenever
  |a node moves from one network to another, or net numbers change.
  
  No. When you move a machine to another net, you only have to ctnode -root 
  it, or "replace" the entry in the naming server database (using edns) with 
  the new netid instead of the old one.

I hate to contradict you, but as of sr9.7 you certainly had to remove and
reboot all nodes if any changed net ids.  I know this because I did some
work on the hint file code myself at that time.  I don't think this has
changed since then, but I could be wrong.  I haven't tried it.

  If I _had to_ change net numbers, I would do it like this:
  [ complicated 6 step procedure deleted ]

I always use my IP address as my Domain net address.  This worked fine until
the University helpfully decided to change all our IP addresses.

ced@apollo.HP.COM (Carl Davidson) (04/10/91)

From article <50cad431.cb12@dabo.citi.umich.edu>, by rees@dabo.citi.umich.edu (Jim Rees):
> 
> When you want to open a file, Domain/OS first translates the name into a uid
> (like a dev/inode).  Then it has to locate the node holding that object.  In
> the old, old days, that was easy, because the uid contains the node id of
> the node on which it was generated.
> 

It still does.

>
> [deleted description of how uids are used to locate files]
>
> Then internets came along, and it got way harder.  Now the hint file had to
> be changed to hold network numbers as well as node ids.  The bad news is
> that the hint file doesn't get flushed when it should, and stale information
> is sometimes used after a floppy changes nodes or a node changes networks.
> In general, you need to remove all hint files and reboot all nodes whenever
> a node moves from one network to another, or net numbers change.
>

The real culprit here, at least as of SR10.*, is the VM system, which 
caches Internet addresses in the object tables. Although the hint file
is updated by recataloging the node, the object tables in memory on the 
nodes which *weren't* moved still retain the old Internet address of the 
node and name resolution relies in part on these tables. The only way to 
purge the VM tables was to reboot the node, which, as Jim rightly indicates,
means potentially having to reboot every other node in an Internet. Not 
a pretty thought (we have over 2000 nodes here in Chelmsford). This has 
been changed in SR10.4, which will be released later this year (apply 
the usual disclaimers here). 

At SR10.4, the VM system is smart enough to time out stale Internet addresses
in its' tables. This should make moving nodes within an Internet 
easier. Also, SR10.4 automatically creates a new hint file when you reboot
a machine, so stale hints are even less likely to hang around.

> 
> Things become slow because when Domain/OS has to try a remote network to
> locate an object, it doesn't get a nack as it does on the local network.
> Instead it has to time out.  If it has to time out several remote networks,
> it can take a long time.
> 

Some of the timeout code was revised at SR10.4 also. Hopefully, it will
help some.

>
> The hint file code is old and crufty and probably could use some work.
> 

As of SR10.3 the hint manager keeps hints in "lru" order and is less
likely to throw away good hints and keep bad ones. Also, a lot of what
gets blamed on the hint manager is really caused by other parts of the
system. I know this because I used to blame a lot of what got fixed in
the VM subsystem on the hint file until looking closely at the problem.
(No, I'm not the person who fixed the VM system, but I am the one who 
complained until it was fixed :-))

 
Carl Davidson  (508) 256-6600 x4361    | 
The Apollo Systems Divison of          |  It doesn't TAKE all kinds,
The Hewlett-Packard Company            |  there just ARE  all kinds.
DOMAIN: ced@apollo.HP.COM              |