[comp.sys.apollo] memory allocation bug

derstad@CIM-VAX.HONEYWELL.COM ("DAVE ERSTAD") (12/11/89)

After spending my weekend glued to the tube, I finally found the reason
some of our applications grind to a thrashing halt.

Near as I can tell, there is a bug (?) in memory allocation at SR10
which causes a node to thrash violently even if the application
very clearly should not thrash.

For example, I wrote a program which allocated (via a Pascal NEW)
memory in 1Kbyte chunks.  Each time it allocated a chunk, the 
program touched each memory location in the chunk.

At SR9.7 on a DN4000 with 16 MB Ram, and 80 MB free disk, I was able
to allocate 40 MB of memory without much problem.  There was paging
to disk, but no more than I expected.

At SR 10.1.1 on a DN4000 with 32 MB Ram, and 160 MB free disk, at
arouund 16 MB of allocated memory the allocation process began
thrashing badly.  I was unable in over an hour to get more than
20 MB allocated (The SR9 version of the program was completely
done in less than 5 minutes)

There is no way 16 MB of data should thrash a 32 MB node!  Aside
from anything else, each data item is only touched once and
should be paged to disk (never to be paged back) when needed -
and this is what we say on 9.7.  From a DPAT we did, the problem
appears to be in the memory allocation routines.

I can post the sample program if anyone is interested, but I've got
a couple of question.
   1)  Hasn't anyone else seen this?  (Incidentally, we verified the 
       same behavior on the DN10000 - This could easily account
       for the FORTRAN compilation cited by Hanche-Olsen if the
       FORTRAN compiler does a lot of dynamic memory allocation
       for symbol tables or whatever).
   2)  Does anyone know what's broken?  Is Apollo aware of this problem?
   3)  How do I work around this without re-writing zillions of lines of 
       code?

Dave Erstad
Honeywell SSEC
DERSTAD@cim-vax.honeywell.com

derstad@CIM-VAX.HONEYWELL.COM ("DAVE ERSTAD") (12/12/89)

wescott@lnic5.hprc.uh.edu writes:

>  I think the problem you are encountering is a known one.
>  Although I was not around at SR 9.7, I understand that
>  the allocation of virtual memory was quite different
>  than the UNIX-like scheme employed at SR 10.  As you may
>  know, a code running at SR 10 opens a paging file
>  immediately at run time, while SR 9 only opens this upon
>  demand.  This does not mean that the SR 10 node pages
>  immediately...but you know what I mean.  Anyhow, there
>  has been alot of confusion and disgust voiced from SR 9
>  users that wrote their codes to fit the O/S characteristics.

Yes, that is an area of concern;  however, it is a different
issue.  The fact that I touch (assign a value into) each memory
location once removes any differences due to this effect (Since
at SR9 I am demanding each page).  In any case, this should not 
cause thrashing, just a uniform increase in disk accesses. 

BTW, I was one of the gripers about this change from SR9 since
I did have to rewrite some code and wasn't happy about it.

I found there is a patch on the November patch tape that sounds
suspiciously like this problem but only applies to SAU2, 3, and 5
machines.  I haven't gotten any word back from the Apollo
hotline yet.

Dave Erstad
Honeywell SSEC
DERSTAD@cim-vax.honeywell.com

derstad@CIM-VAX.HONEYWELL.COM ("DAVE ERSTAD") (12/12/89)

I've received a request for the test program which
demonstrates the alleged memory allocation bug in SR10.1

The program is given below, along with some results.  The point
at which SR10 breaks down appears to be somewhat variable, and
I haven't tried to find out what causes that variability.  It always
breaks, though.  We've verified this at SR10.1, SR10.1.1, and SR10.1.p.
BTW, the results below are with -opt 0 to prevent anything from 
being optimized away.


   SR9.7.0.4                           SR10.1.1
   DN4000 16.0 MB memory               DN4500  32.0 MB memory
   170MB disk                          380 MB FA disk

Alloc speed - alloc  1     7.4         Alloc speed - alloc  1     1.5
Alloc speed - alloc  2    12.8         Alloc speed - alloc  2     2.5
Alloc speed - alloc  3    10.3         Alloc speed - alloc  3     3.4
Alloc speed - alloc  4     9.8         Alloc speed - alloc  4     3.0
Alloc speed - alloc  5     7.4         Alloc speed - alloc  5     3.0
Alloc speed - alloc  6     7.4         Alloc speed - alloc  6     3.4
Alloc speed - alloc  7     8.4         Alloc speed - alloc  7     3.9
Alloc speed - alloc  8    10.8         Alloc speed - alloc  8     4.4
Alloc speed - alloc  9     9.8         Alloc speed - alloc  9     4.4
Alloc speed - alloc 10     7.9         Alloc speed - alloc 10     5.4
Alloc speed - alloc 11    10.8         Alloc speed - alloc 11    51.7
Alloc speed - alloc 12    10.3         ( The next one is > 600 - I killed it )
Alloc speed - alloc 13     8.4     
Alloc speed - alloc 14    12.3
Alloc speed - alloc 15     9.4
Alloc speed - alloc 16     9.8
Alloc speed - alloc 17     9.8
Alloc speed - alloc 18     8.4
Alloc speed - alloc 19     8.9
Alloc speed - alloc 20     8.9


Dave Erstad
DERSTAD@cim-vax.honeywell.com
--------------------------------------------------------------------

program test;

%nolist;
%include '/sys/ins/base.ins.pas';
%include '/sys/ins/cal.ins.pas';
%list;

type
   onek = array[1..256] of integer32;

const
   size = 2000 { kb};
   max_alloc = 20;

var
   ar : array[1..size] of ^onek;

   num_alloc : integer;

(*************************************************************)
(*                                                           *)
(*  Real_seconds                      Function  Integer32    *)
(*                                                           *)
(*  Determines the number of real-time seconds since         *)
(*  midnight.                                                *)
(*                                                           *)
(*************************************************************)
function real_seconds : integer32;

var
   temp_rec
      :  cal_$timedate_rec_t;

begin (* real_seconds *)

   cal_$decode_local_time(temp_rec);
   with temp_rec do
      real_seconds := hour * 3600+minute*60+second;

end (* real_seconds *);
      
(*************************************************************)
(*                                                           *)
(*  Alloc                               Procedure            *)
(*                                                           *)
(*  Allocate (size) items, each about 1k byte.  Touch        *)
(*  each data time.                                          *)
(*                                                           *)
(*************************************************************)
procedure alloc;   

var
   i, j, r1, r2 : integer32;

begin

   r1 := real_seconds;

   for i := 1 to size do
      begin
         new(ar[i]);
         for j := 1 to 256 do
           ar[i]^[j] := 0;
      end;

   r2 := real_seconds;

   writeln('Alloc speed - alloc ', num_alloc:2, 1000*(r2 - r1) / size:8:1);
end;


begin

   for num_alloc := 1 to max_alloc do
      alloc;

end.

-------------------------------------------------------------------------------

derstad@CIM-VAX.HONEYWELL.COM ("DAVE ERSTAD") (12/13/89)

Thanks to all who responded to the memory allocation problem.
Here's a summary of what's going on, which is a combination
of contributed info and further experiments we've done.

1)  There is a problem.  It is a big problem for applications with
    lots of memory use.

2)  The problem is only with PASCAL dynamic memory allocation

3)  The problem goes away at 10.2

4)  At least one of the patches on the November patch tape
    significantly moves the point at which the problem
    occurs for DN4500.  Whichever patch it is, it doesn't
    help DN3000 machines.

5)  The workaround given by Adam Matusiak does solve the 
    problem (uses malloc in place of new) works.

BTW, Apollo has no plans to generate a patch even though
they have known about this for months.  Their solution
is to load 10.2, which doesn't help those of us with
substantial quantities of third party software which
isn't supported under 10.2.

There should be a better way for Apollo to communicate
known bugs of this magnitude to users.  Even if they can't 
generate a patch,  putting a note about this bug in the 
patch tape release notes would help.  

A mention in the 10.2 release notes would have been 
nice as well.

Dave Erstad
Honeywell SSEC
DERSTAD@cim-vax.honeywell.com

derstad@CIM-VAX.HONEYWELL.COM ("DAVE ERSTAD") (12/13/89)

One last note:  To use Adam's workaround on an SR9 compile, change
"c_param" to "val_param, d0_return".  They are equivalent but the
former is not supported at SR9.

Of course, you only need to do this if you need code than runs at 
both SR9 and 10 (as we do).

oj@apollo.HP.COM (Ellis Oliver Jones) (12/13/89)

In article <8912111825.AA06689@umix.cc.umich.edu> derstad@CIM-VAX.HONEYWELL.COM ("DAVE ERSTAD") writes:
>
>I found there is a patch on the November patch tape that sounds
>suspiciously like this problem but only applies to SAU2, 3, and 5
>machines.  I haven't gotten any word back from the Apollo
>hotline yet.

Hmm, either your definition of the term "November Patch Tape" is different
from ours, or you have a strange tape.  As far as I can tell, there
wasn't a Prism November patch tape, and the Moto tape contained
only patches related to the X Window System.
/oj (speaking for myself, not necessarily for HP Apollo)

derstad@CIM-VAX.HONEYWELL.COM ("DAVE ERSTAD") (12/14/89)

>   >I found there is a patch on the November patch tape that sounds
>   >suspiciously like this problem but only applies to SAU2, 3, and 5
>   >machines.  I haven't gotten any word back from the Apollo
>   >hotline yet.
>   
>   Hmm, either your definition of the term "November Patch Tape" is different
>   from ours, or you have a strange tape.  As far as I can tell, there
>   wasn't a Prism November patch tape, and the Moto tape contained
>   only patches related to the X Window System.

The patch tape had release notes labeled

                         Patch_M68K_8911 Release Notes

There were many non-X patches (over half, I'm sure).  The bulk seemed
to fix sau7/DN4500 problems.

The patch in question was unrelated to the problem, however.

Dave Erstad
Honeywell SSEC
DERSTAD@cim-vax.honeywell.com

oj@apollo.HP.COM (Ellis Oliver Jones) (12/19/89)

In article <8912141641.AA09668@umix.cc.umich.edu> Dave Erstad took issue with my posting:

>>   Hmm, either your definition of the term "November Patch Tape" is different
>>   from ours, or you have a strange tape.  As far as I can tell, there
>>   wasn't a Prism November patch tape, and the Moto tape contained
>>   only patches related to the X Window System.

Dave wrote:
>There were many non-X patches (over half, I'm sure).  The bulk seemed
>to fix sau7/DN4500 problems.

He's right.  The patch tapes are cumulative, of course.   I feel
unnaturally possessive about the November tape because I prepared the
only two NEW patches on it.  Sorry for the wrong info.
/Ollie Jones.