[comp.sys.hp] Probable RPC bug on HP9000/300.

robert@spam.istc.sri.com (08/10/88)
    In recent months I have been engaged in porting several applications,
    which use Remote Procedure Calls (RPCs) for Inter-Process Communication
    (IPC), from a Sun hardware base to an HP hardware base.  Needless to say,
    there have been major problems.  One particularly vexing problem has been
    with an RPC which seemed to hang on execution, in the client, during an
    RPC to the server.  Both the client and server where on the same host,
    an HP9000/300 running HP/UX 6.01, and both worked fine on a Sun.  Since
    HP ported the Sun RPC code, the port of the applications should have been
    very straightforward.

    Yesterday, after weeks of fighting of this problem, we apparently dis-
    covered the culprit.

    Statement of problem:

    When the RPC client executes several sucessive calls to the RPC server, 
    the RPC client hangs in the RPC code, apparently in an infinite loop
    inside the HP function, _rpc_malloc().  If we run the client in the
    debugger, HPs proprietary cdb, we can interrupt the client when it
    hangs, and do a stack trace.  This is what it shows:

---------------------------------------------------------
interrupt (ignore) at 0x13e92
(file unknown): _malloc +0x9e: (line unknown)
_malloc+0x9e:           mov.l   __bufendtab+0x10c,%d0
>t
 0 _malloc +0x9e (0x6, 0x6, 0xffeff688, 0xffeff688, 0x10b3c)
 1 __rpc_malloc +0x16 (0, 0, 0, 0xffeff740, 0x21cd4)
 2 _xdr_bytes +0x8a (0x21cd4, 0xffeff744, 0xffeff748, 0x190, 0xffeff740)
 3 _xdr_opaque_auth +0x34 (0x21cd4, 0xffeff740, 0xffeff734, 0x21cd4, 0xffeff6dc)
 4 _xdr_accepted_reply +0x1a (0x21cd4, 0xffeff740, 0xffffffff, 0x1, 0x21cd4)
 5 _xdr_union +0x36 (0x21cd4, 0xffeff73c, 0xffeff740, 0x1c198, 0)
 6 _xdr_replymsg +0x180 (0x21cd4, 0xffeff734, 0x1e, 0, 0xffeffdf1)
 7 _clnttcp_create +0x36a (0x21c88, 0x4, 0x7332, 0xffeff7f8, 0x71e6)
 8 rpc_call (host = 0x21848, prognum = 100100, versnum = 4, procnum = 4,
	     inproc = 0x7332, in = 0xffeff7f8, outproc = 0x71e6,
	     out = 0xffeff804, proto = 6, time = 30)    [rpc_call.c: 98]
 9 bfill (fqkeyname = 0xffeff838, type = 2)    [bfill.c: 98]
10 ns_dump (fqkeyname = 0xffeffc64, fp = 0x1e9b0)    [ns_dump.c: 188]
11 dumpcom (dt = 3)    [dumpcom.c: 300]
12 main (argc = 1, argv = 0xffeffda4)    [ns_tool.c: 210]
>
---------------------------------------------------------

    Note that the program is apparently hanging in malloc (also note the
    strange syntax of the debugger, which says that all the system arguments
    take 5 paramaters).  Under further investigation (not shown here), we
    `c'ontinued execution and interrupted it several times in succession, and
    we found that the address of the code being executed would loop through a
    specific set of addresses in malloc().  Since malloc() should be a simple
    function, and it seemed to work in other applications we had, we began to
    suspect that _rpc_malloc() was calling malloc repeatedly, and never re-
    turning to the calling routines.

    Let's look at the HP file /usr/include/rpc/types.h, where _rpc_malloc(),
    and its counterpart _rpc_free(), are macroed into the RPC code:

	-------------------------------------------------
	/*
	 *
	 * (c) Copyright 1987 Hewlett-Packard Company
	 * (c) Copyright 1984 Sun Microsystems, Inc.
	 */

		/* some stuff deleted for brevity */

	#define mem_alloc(bsize)	_rpc_malloc(bsize)
	#define mem_free(ptr, bsize)	_rpc_free(ptr)
	char *_rpc_malloc();
	void _rpc_free();
	-------------------------------------------------

    Now let's look at the same code from a Sun:

	-------------------------------------------------
	/*      @(#)types.h 1.18 87/07/24 SMI      */


	    /* some stuff deleted for brevity */

	extern char *malloc();
	#define mem_alloc(bsize)	malloc(bsize)
	#define mem_free(ptr, bsize)	free(ptr)
	-------------------------------------------------

    Notice that the macros mem_alloc and mem_free are different on the
    two systems.  Because of our suspicions, we wrote our own routines,
    _rpc_malloc() and _rpc_free(),which merely called malloc() and free()
    respectively, and linked them into the client we were running, thus
    bypassing the HP provided _rpc_malloc() and _rpc_free().  Surprise, the
    client now can communicate perfectly with the server.

    This seems to me to be a pretty effective statement that there is a
    bug in HPs RPC memory allocation code.  We have a call into HP tech
    support, but because the normal tech support personnel have to go to
    the engineers on a problem like this, the reply from HP is likely to
    take a week or more.  If there are any HP people who read this and
    have access to the source, could you please check the code to _rpc_malloc()
    and see what you can see?  We would check ourselves, but we don't yet
    have a src license (we will soon), and even if we did, the src license
    doesn't include all the librarys (not to mention compilers, and other
    stuff most normal src licenses include) so we might still be stuck.

    If you are a develop having problems with RPC code, you might see if
    you are having a similar problem to what we are.

--------------------------------------------------------------------------
  Robert Allen,
		robert@spam.istc.sri.com,
					    415-859-2143 (work phone, days)
--------------------------------------------------------------------------