derstad@CIM-VAX.HONEYWELL.COM ("DAVE ERSTAD") (12/11/89)
After spending my weekend glued to the tube, I finally found the reason some of our applications grind to a thrashing halt. Near as I can tell, there is a bug (?) in memory allocation at SR10 which causes a node to thrash violently even if the application very clearly should not thrash. For example, I wrote a program which allocated (via a Pascal NEW) memory in 1Kbyte chunks. Each time it allocated a chunk, the program touched each memory location in the chunk. At SR9.7 on a DN4000 with 16 MB Ram, and 80 MB free disk, I was able to allocate 40 MB of memory without much problem. There was paging to disk, but no more than I expected. At SR 10.1.1 on a DN4000 with 32 MB Ram, and 160 MB free disk, at arouund 16 MB of allocated memory the allocation process began thrashing badly. I was unable in over an hour to get more than 20 MB allocated (The SR9 version of the program was completely done in less than 5 minutes) There is no way 16 MB of data should thrash a 32 MB node! Aside from anything else, each data item is only touched once and should be paged to disk (never to be paged back) when needed - and this is what we say on 9.7. From a DPAT we did, the problem appears to be in the memory allocation routines. I can post the sample program if anyone is interested, but I've got a couple of question. 1) Hasn't anyone else seen this? (Incidentally, we verified the same behavior on the DN10000 - This could easily account for the FORTRAN compilation cited by Hanche-Olsen if the FORTRAN compiler does a lot of dynamic memory allocation for symbol tables or whatever). 2) Does anyone know what's broken? Is Apollo aware of this problem? 3) How do I work around this without re-writing zillions of lines of code? Dave Erstad Honeywell SSEC DERSTAD@cim-vax.honeywell.com
derstad@CIM-VAX.HONEYWELL.COM ("DAVE ERSTAD") (12/12/89)
wescott@lnic5.hprc.uh.edu writes: > I think the problem you are encountering is a known one. > Although I was not around at SR 9.7, I understand that > the allocation of virtual memory was quite different > than the UNIX-like scheme employed at SR 10. As you may > know, a code running at SR 10 opens a paging file > immediately at run time, while SR 9 only opens this upon > demand. This does not mean that the SR 10 node pages > immediately...but you know what I mean. Anyhow, there > has been alot of confusion and disgust voiced from SR 9 > users that wrote their codes to fit the O/S characteristics. Yes, that is an area of concern; however, it is a different issue. The fact that I touch (assign a value into) each memory location once removes any differences due to this effect (Since at SR9 I am demanding each page). In any case, this should not cause thrashing, just a uniform increase in disk accesses. BTW, I was one of the gripers about this change from SR9 since I did have to rewrite some code and wasn't happy about it. I found there is a patch on the November patch tape that sounds suspiciously like this problem but only applies to SAU2, 3, and 5 machines. I haven't gotten any word back from the Apollo hotline yet. Dave Erstad Honeywell SSEC DERSTAD@cim-vax.honeywell.com
derstad@CIM-VAX.HONEYWELL.COM ("DAVE ERSTAD") (12/12/89)
I've received a request for the test program which demonstrates the alleged memory allocation bug in SR10.1 The program is given below, along with some results. The point at which SR10 breaks down appears to be somewhat variable, and I haven't tried to find out what causes that variability. It always breaks, though. We've verified this at SR10.1, SR10.1.1, and SR10.1.p. BTW, the results below are with -opt 0 to prevent anything from being optimized away. SR9.7.0.4 SR10.1.1 DN4000 16.0 MB memory DN4500 32.0 MB memory 170MB disk 380 MB FA disk Alloc speed - alloc 1 7.4 Alloc speed - alloc 1 1.5 Alloc speed - alloc 2 12.8 Alloc speed - alloc 2 2.5 Alloc speed - alloc 3 10.3 Alloc speed - alloc 3 3.4 Alloc speed - alloc 4 9.8 Alloc speed - alloc 4 3.0 Alloc speed - alloc 5 7.4 Alloc speed - alloc 5 3.0 Alloc speed - alloc 6 7.4 Alloc speed - alloc 6 3.4 Alloc speed - alloc 7 8.4 Alloc speed - alloc 7 3.9 Alloc speed - alloc 8 10.8 Alloc speed - alloc 8 4.4 Alloc speed - alloc 9 9.8 Alloc speed - alloc 9 4.4 Alloc speed - alloc 10 7.9 Alloc speed - alloc 10 5.4 Alloc speed - alloc 11 10.8 Alloc speed - alloc 11 51.7 Alloc speed - alloc 12 10.3 ( The next one is > 600 - I killed it ) Alloc speed - alloc 13 8.4 Alloc speed - alloc 14 12.3 Alloc speed - alloc 15 9.4 Alloc speed - alloc 16 9.8 Alloc speed - alloc 17 9.8 Alloc speed - alloc 18 8.4 Alloc speed - alloc 19 8.9 Alloc speed - alloc 20 8.9 Dave Erstad DERSTAD@cim-vax.honeywell.com -------------------------------------------------------------------- program test; %nolist; %include '/sys/ins/base.ins.pas'; %include '/sys/ins/cal.ins.pas'; %list; type onek = array[1..256] of integer32; const size = 2000 { kb}; max_alloc = 20; var ar : array[1..size] of ^onek; num_alloc : integer; (*************************************************************) (* *) (* Real_seconds Function Integer32 *) (* *) (* Determines the number of real-time seconds since *) (* midnight. *) (* *) (*************************************************************) function real_seconds : integer32; var temp_rec : cal_$timedate_rec_t; begin (* real_seconds *) cal_$decode_local_time(temp_rec); with temp_rec do real_seconds := hour * 3600+minute*60+second; end (* real_seconds *); (*************************************************************) (* *) (* Alloc Procedure *) (* *) (* Allocate (size) items, each about 1k byte. Touch *) (* each data time. *) (* *) (*************************************************************) procedure alloc; var i, j, r1, r2 : integer32; begin r1 := real_seconds; for i := 1 to size do begin new(ar[i]); for j := 1 to 256 do ar[i]^[j] := 0; end; r2 := real_seconds; writeln('Alloc speed - alloc ', num_alloc:2, 1000*(r2 - r1) / size:8:1); end; begin for num_alloc := 1 to max_alloc do alloc; end. -------------------------------------------------------------------------------
derstad@CIM-VAX.HONEYWELL.COM ("DAVE ERSTAD") (12/13/89)
Thanks to all who responded to the memory allocation problem. Here's a summary of what's going on, which is a combination of contributed info and further experiments we've done. 1) There is a problem. It is a big problem for applications with lots of memory use. 2) The problem is only with PASCAL dynamic memory allocation 3) The problem goes away at 10.2 4) At least one of the patches on the November patch tape significantly moves the point at which the problem occurs for DN4500. Whichever patch it is, it doesn't help DN3000 machines. 5) The workaround given by Adam Matusiak does solve the problem (uses malloc in place of new) works. BTW, Apollo has no plans to generate a patch even though they have known about this for months. Their solution is to load 10.2, which doesn't help those of us with substantial quantities of third party software which isn't supported under 10.2. There should be a better way for Apollo to communicate known bugs of this magnitude to users. Even if they can't generate a patch, putting a note about this bug in the patch tape release notes would help. A mention in the 10.2 release notes would have been nice as well. Dave Erstad Honeywell SSEC DERSTAD@cim-vax.honeywell.com
derstad@CIM-VAX.HONEYWELL.COM ("DAVE ERSTAD") (12/13/89)
One last note: To use Adam's workaround on an SR9 compile, change "c_param" to "val_param, d0_return". They are equivalent but the former is not supported at SR9. Of course, you only need to do this if you need code than runs at both SR9 and 10 (as we do).
oj@apollo.HP.COM (Ellis Oliver Jones) (12/13/89)
In article <8912111825.AA06689@umix.cc.umich.edu> derstad@CIM-VAX.HONEYWELL.COM ("DAVE ERSTAD") writes: > >I found there is a patch on the November patch tape that sounds >suspiciously like this problem but only applies to SAU2, 3, and 5 >machines. I haven't gotten any word back from the Apollo >hotline yet. Hmm, either your definition of the term "November Patch Tape" is different from ours, or you have a strange tape. As far as I can tell, there wasn't a Prism November patch tape, and the Moto tape contained only patches related to the X Window System. /oj (speaking for myself, not necessarily for HP Apollo)
derstad@CIM-VAX.HONEYWELL.COM ("DAVE ERSTAD") (12/14/89)
> >I found there is a patch on the November patch tape that sounds > >suspiciously like this problem but only applies to SAU2, 3, and 5 > >machines. I haven't gotten any word back from the Apollo > >hotline yet. > > Hmm, either your definition of the term "November Patch Tape" is different > from ours, or you have a strange tape. As far as I can tell, there > wasn't a Prism November patch tape, and the Moto tape contained > only patches related to the X Window System. The patch tape had release notes labeled Patch_M68K_8911 Release Notes There were many non-X patches (over half, I'm sure). The bulk seemed to fix sau7/DN4500 problems. The patch in question was unrelated to the problem, however. Dave Erstad Honeywell SSEC DERSTAD@cim-vax.honeywell.com
oj@apollo.HP.COM (Ellis Oliver Jones) (12/19/89)
In article <8912141641.AA09668@umix.cc.umich.edu> Dave Erstad took issue with my posting: >> Hmm, either your definition of the term "November Patch Tape" is different >> from ours, or you have a strange tape. As far as I can tell, there >> wasn't a Prism November patch tape, and the Moto tape contained >> only patches related to the X Window System. Dave wrote: >There were many non-X patches (over half, I'm sure). The bulk seemed >to fix sau7/DN4500 problems. He's right. The patch tapes are cumulative, of course. I feel unnaturally possessive about the November tape because I prepared the only two NEW patches on it. Sorry for the wrong info. /Ollie Jones.