robert@spam.istc.sri.com (08/10/88)
In recent months I have been engaged in porting several applications,
which use Remote Procedure Calls (RPCs) for Inter-Process Communication
(IPC), from a Sun hardware base to an HP hardware base. Needless to say,
there have been major problems. One particularly vexing problem has been
with an RPC which seemed to hang on execution, in the client, during an
RPC to the server. Both the client and server where on the same host,
an HP9000/300 running HP/UX 6.01, and both worked fine on a Sun. Since
HP ported the Sun RPC code, the port of the applications should have been
very straightforward.
Yesterday, after weeks of fighting of this problem, we apparently dis-
covered the culprit.
Statement of problem:
When the RPC client executes several sucessive calls to the RPC server,
the RPC client hangs in the RPC code, apparently in an infinite loop
inside the HP function, _rpc_malloc(). If we run the client in the
debugger, HPs proprietary cdb, we can interrupt the client when it
hangs, and do a stack trace. This is what it shows:
---------------------------------------------------------
interrupt (ignore) at 0x13e92
(file unknown): _malloc +0x9e: (line unknown)
_malloc+0x9e: mov.l __bufendtab+0x10c,%d0
>t
0 _malloc +0x9e (0x6, 0x6, 0xffeff688, 0xffeff688, 0x10b3c)
1 __rpc_malloc +0x16 (0, 0, 0, 0xffeff740, 0x21cd4)
2 _xdr_bytes +0x8a (0x21cd4, 0xffeff744, 0xffeff748, 0x190, 0xffeff740)
3 _xdr_opaque_auth +0x34 (0x21cd4, 0xffeff740, 0xffeff734, 0x21cd4, 0xffeff6dc)
4 _xdr_accepted_reply +0x1a (0x21cd4, 0xffeff740, 0xffffffff, 0x1, 0x21cd4)
5 _xdr_union +0x36 (0x21cd4, 0xffeff73c, 0xffeff740, 0x1c198, 0)
6 _xdr_replymsg +0x180 (0x21cd4, 0xffeff734, 0x1e, 0, 0xffeffdf1)
7 _clnttcp_create +0x36a (0x21c88, 0x4, 0x7332, 0xffeff7f8, 0x71e6)
8 rpc_call (host = 0x21848, prognum = 100100, versnum = 4, procnum = 4,
inproc = 0x7332, in = 0xffeff7f8, outproc = 0x71e6,
out = 0xffeff804, proto = 6, time = 30) [rpc_call.c: 98]
9 bfill (fqkeyname = 0xffeff838, type = 2) [bfill.c: 98]
10 ns_dump (fqkeyname = 0xffeffc64, fp = 0x1e9b0) [ns_dump.c: 188]
11 dumpcom (dt = 3) [dumpcom.c: 300]
12 main (argc = 1, argv = 0xffeffda4) [ns_tool.c: 210]
>
---------------------------------------------------------
Note that the program is apparently hanging in malloc (also note the
strange syntax of the debugger, which says that all the system arguments
take 5 paramaters). Under further investigation (not shown here), we
`c'ontinued execution and interrupted it several times in succession, and
we found that the address of the code being executed would loop through a
specific set of addresses in malloc(). Since malloc() should be a simple
function, and it seemed to work in other applications we had, we began to
suspect that _rpc_malloc() was calling malloc repeatedly, and never re-
turning to the calling routines.
Let's look at the HP file /usr/include/rpc/types.h, where _rpc_malloc(),
and its counterpart _rpc_free(), are macroed into the RPC code:
-------------------------------------------------
/*
*
* (c) Copyright 1987 Hewlett-Packard Company
* (c) Copyright 1984 Sun Microsystems, Inc.
*/
/* some stuff deleted for brevity */
#define mem_alloc(bsize) _rpc_malloc(bsize)
#define mem_free(ptr, bsize) _rpc_free(ptr)
char *_rpc_malloc();
void _rpc_free();
-------------------------------------------------
Now let's look at the same code from a Sun:
-------------------------------------------------
/* @(#)types.h 1.18 87/07/24 SMI */
/* some stuff deleted for brevity */
extern char *malloc();
#define mem_alloc(bsize) malloc(bsize)
#define mem_free(ptr, bsize) free(ptr)
-------------------------------------------------
Notice that the macros mem_alloc and mem_free are different on the
two systems. Because of our suspicions, we wrote our own routines,
_rpc_malloc() and _rpc_free(),which merely called malloc() and free()
respectively, and linked them into the client we were running, thus
bypassing the HP provided _rpc_malloc() and _rpc_free(). Surprise, the
client now can communicate perfectly with the server.
This seems to me to be a pretty effective statement that there is a
bug in HPs RPC memory allocation code. We have a call into HP tech
support, but because the normal tech support personnel have to go to
the engineers on a problem like this, the reply from HP is likely to
take a week or more. If there are any HP people who read this and
have access to the source, could you please check the code to _rpc_malloc()
and see what you can see? We would check ourselves, but we don't yet
have a src license (we will soon), and even if we did, the src license
doesn't include all the librarys (not to mention compilers, and other
stuff most normal src licenses include) so we might still be stuck.
If you are a develop having problems with RPC code, you might see if
you are having a similar problem to what we are.
--------------------------------------------------------------------------
Robert Allen,
robert@spam.istc.sri.com,
415-859-2143 (work phone, days)
--------------------------------------------------------------------------