derstad@CIM-VAX.HONEYWELL.COM ("DAVE ERSTAD") (12/11/89)
After spending my weekend glued to the tube, I finally found the reason
some of our applications grind to a thrashing halt.
Near as I can tell, there is a bug (?) in memory allocation at SR10
which causes a node to thrash violently even if the application
very clearly should not thrash.
For example, I wrote a program which allocated (via a Pascal NEW)
memory in 1Kbyte chunks. Each time it allocated a chunk, the
program touched each memory location in the chunk.
At SR9.7 on a DN4000 with 16 MB Ram, and 80 MB free disk, I was able
to allocate 40 MB of memory without much problem. There was paging
to disk, but no more than I expected.
At SR 10.1.1 on a DN4000 with 32 MB Ram, and 160 MB free disk, at
arouund 16 MB of allocated memory the allocation process began
thrashing badly. I was unable in over an hour to get more than
20 MB allocated (The SR9 version of the program was completely
done in less than 5 minutes)
There is no way 16 MB of data should thrash a 32 MB node! Aside
from anything else, each data item is only touched once and
should be paged to disk (never to be paged back) when needed -
and this is what we say on 9.7. From a DPAT we did, the problem
appears to be in the memory allocation routines.
I can post the sample program if anyone is interested, but I've got
a couple of question.
1) Hasn't anyone else seen this? (Incidentally, we verified the
same behavior on the DN10000 - This could easily account
for the FORTRAN compilation cited by Hanche-Olsen if the
FORTRAN compiler does a lot of dynamic memory allocation
for symbol tables or whatever).
2) Does anyone know what's broken? Is Apollo aware of this problem?
3) How do I work around this without re-writing zillions of lines of
code?
Dave Erstad
Honeywell SSEC
DERSTAD@cim-vax.honeywell.comderstad@CIM-VAX.HONEYWELL.COM ("DAVE ERSTAD") (12/12/89)
wescott@lnic5.hprc.uh.edu writes: > I think the problem you are encountering is a known one. > Although I was not around at SR 9.7, I understand that > the allocation of virtual memory was quite different > than the UNIX-like scheme employed at SR 10. As you may > know, a code running at SR 10 opens a paging file > immediately at run time, while SR 9 only opens this upon > demand. This does not mean that the SR 10 node pages > immediately...but you know what I mean. Anyhow, there > has been alot of confusion and disgust voiced from SR 9 > users that wrote their codes to fit the O/S characteristics. Yes, that is an area of concern; however, it is a different issue. The fact that I touch (assign a value into) each memory location once removes any differences due to this effect (Since at SR9 I am demanding each page). In any case, this should not cause thrashing, just a uniform increase in disk accesses. BTW, I was one of the gripers about this change from SR9 since I did have to rewrite some code and wasn't happy about it. I found there is a patch on the November patch tape that sounds suspiciously like this problem but only applies to SAU2, 3, and 5 machines. I haven't gotten any word back from the Apollo hotline yet. Dave Erstad Honeywell SSEC DERSTAD@cim-vax.honeywell.com
derstad@CIM-VAX.HONEYWELL.COM ("DAVE ERSTAD") (12/12/89)
I've received a request for the test program which
demonstrates the alleged memory allocation bug in SR10.1
The program is given below, along with some results. The point
at which SR10 breaks down appears to be somewhat variable, and
I haven't tried to find out what causes that variability. It always
breaks, though. We've verified this at SR10.1, SR10.1.1, and SR10.1.p.
BTW, the results below are with -opt 0 to prevent anything from
being optimized away.
SR9.7.0.4 SR10.1.1
DN4000 16.0 MB memory DN4500 32.0 MB memory
170MB disk 380 MB FA disk
Alloc speed - alloc 1 7.4 Alloc speed - alloc 1 1.5
Alloc speed - alloc 2 12.8 Alloc speed - alloc 2 2.5
Alloc speed - alloc 3 10.3 Alloc speed - alloc 3 3.4
Alloc speed - alloc 4 9.8 Alloc speed - alloc 4 3.0
Alloc speed - alloc 5 7.4 Alloc speed - alloc 5 3.0
Alloc speed - alloc 6 7.4 Alloc speed - alloc 6 3.4
Alloc speed - alloc 7 8.4 Alloc speed - alloc 7 3.9
Alloc speed - alloc 8 10.8 Alloc speed - alloc 8 4.4
Alloc speed - alloc 9 9.8 Alloc speed - alloc 9 4.4
Alloc speed - alloc 10 7.9 Alloc speed - alloc 10 5.4
Alloc speed - alloc 11 10.8 Alloc speed - alloc 11 51.7
Alloc speed - alloc 12 10.3 ( The next one is > 600 - I killed it )
Alloc speed - alloc 13 8.4
Alloc speed - alloc 14 12.3
Alloc speed - alloc 15 9.4
Alloc speed - alloc 16 9.8
Alloc speed - alloc 17 9.8
Alloc speed - alloc 18 8.4
Alloc speed - alloc 19 8.9
Alloc speed - alloc 20 8.9
Dave Erstad
DERSTAD@cim-vax.honeywell.com
--------------------------------------------------------------------
program test;
%nolist;
%include '/sys/ins/base.ins.pas';
%include '/sys/ins/cal.ins.pas';
%list;
type
onek = array[1..256] of integer32;
const
size = 2000 { kb};
max_alloc = 20;
var
ar : array[1..size] of ^onek;
num_alloc : integer;
(*************************************************************)
(* *)
(* Real_seconds Function Integer32 *)
(* *)
(* Determines the number of real-time seconds since *)
(* midnight. *)
(* *)
(*************************************************************)
function real_seconds : integer32;
var
temp_rec
: cal_$timedate_rec_t;
begin (* real_seconds *)
cal_$decode_local_time(temp_rec);
with temp_rec do
real_seconds := hour * 3600+minute*60+second;
end (* real_seconds *);
(*************************************************************)
(* *)
(* Alloc Procedure *)
(* *)
(* Allocate (size) items, each about 1k byte. Touch *)
(* each data time. *)
(* *)
(*************************************************************)
procedure alloc;
var
i, j, r1, r2 : integer32;
begin
r1 := real_seconds;
for i := 1 to size do
begin
new(ar[i]);
for j := 1 to 256 do
ar[i]^[j] := 0;
end;
r2 := real_seconds;
writeln('Alloc speed - alloc ', num_alloc:2, 1000*(r2 - r1) / size:8:1);
end;
begin
for num_alloc := 1 to max_alloc do
alloc;
end.
-------------------------------------------------------------------------------derstad@CIM-VAX.HONEYWELL.COM ("DAVE ERSTAD") (12/13/89)
Thanks to all who responded to the memory allocation problem.
Here's a summary of what's going on, which is a combination
of contributed info and further experiments we've done.
1) There is a problem. It is a big problem for applications with
lots of memory use.
2) The problem is only with PASCAL dynamic memory allocation
3) The problem goes away at 10.2
4) At least one of the patches on the November patch tape
significantly moves the point at which the problem
occurs for DN4500. Whichever patch it is, it doesn't
help DN3000 machines.
5) The workaround given by Adam Matusiak does solve the
problem (uses malloc in place of new) works.
BTW, Apollo has no plans to generate a patch even though
they have known about this for months. Their solution
is to load 10.2, which doesn't help those of us with
substantial quantities of third party software which
isn't supported under 10.2.
There should be a better way for Apollo to communicate
known bugs of this magnitude to users. Even if they can't
generate a patch, putting a note about this bug in the
patch tape release notes would help.
A mention in the 10.2 release notes would have been
nice as well.
Dave Erstad
Honeywell SSEC
DERSTAD@cim-vax.honeywell.comderstad@CIM-VAX.HONEYWELL.COM ("DAVE ERSTAD") (12/13/89)
One last note: To use Adam's workaround on an SR9 compile, change "c_param" to "val_param, d0_return". They are equivalent but the former is not supported at SR9. Of course, you only need to do this if you need code than runs at both SR9 and 10 (as we do).
oj@apollo.HP.COM (Ellis Oliver Jones) (12/13/89)
In article <8912111825.AA06689@umix.cc.umich.edu> derstad@CIM-VAX.HONEYWELL.COM ("DAVE ERSTAD") writes: > >I found there is a patch on the November patch tape that sounds >suspiciously like this problem but only applies to SAU2, 3, and 5 >machines. I haven't gotten any word back from the Apollo >hotline yet. Hmm, either your definition of the term "November Patch Tape" is different from ours, or you have a strange tape. As far as I can tell, there wasn't a Prism November patch tape, and the Moto tape contained only patches related to the X Window System. /oj (speaking for myself, not necessarily for HP Apollo)
derstad@CIM-VAX.HONEYWELL.COM ("DAVE ERSTAD") (12/14/89)
> >I found there is a patch on the November patch tape that sounds > >suspiciously like this problem but only applies to SAU2, 3, and 5 > >machines. I haven't gotten any word back from the Apollo > >hotline yet. > > Hmm, either your definition of the term "November Patch Tape" is different > from ours, or you have a strange tape. As far as I can tell, there > wasn't a Prism November patch tape, and the Moto tape contained > only patches related to the X Window System. The patch tape had release notes labeled Patch_M68K_8911 Release Notes There were many non-X patches (over half, I'm sure). The bulk seemed to fix sau7/DN4500 problems. The patch in question was unrelated to the problem, however. Dave Erstad Honeywell SSEC DERSTAD@cim-vax.honeywell.com
oj@apollo.HP.COM (Ellis Oliver Jones) (12/19/89)
In article <8912141641.AA09668@umix.cc.umich.edu> Dave Erstad took issue with my posting: >> Hmm, either your definition of the term "November Patch Tape" is different >> from ours, or you have a strange tape. As far as I can tell, there >> wasn't a Prism November patch tape, and the Moto tape contained >> only patches related to the X Window System. Dave wrote: >There were many non-X patches (over half, I'm sure). The bulk seemed >to fix sau7/DN4500 problems. He's right. The patch tapes are cumulative, of course. I feel unnaturally possessive about the November tape because I prepared the only two NEW patches on it. Sorry for the wrong info. /Ollie Jones.