[unix-pc.general] unexplained cores after compiles

res@cbnews.ATT.COM (Robert E. Stampfli) (02/14/89)

I have noticed that occasionally I will get a core dump immediately after
invoking a program on the Unix-PC, usually upon trying to run it after
it has just finished compiling.  Simply reinvoking the program without
any changes results in it running fine.  This can not be attributed to a
hardware glitch, as I have seen this on at least three machines I have access
to.  Each has different disks, memory configuration, etc.  They were,
however, all running version 3.5 of the Unix-PC operating system. 

Let me stress that this is a rare occurrance -- I have only seen it about
six times, and I do a lot of compiling and testing on the Unix-PC.
Examination of the cores show nothing useful -- infact, it is sometimes hard
to explain how the code could possibly have faulted given the registers taken
from the core file.

I am curious: has anyone else experienced this?  If so, what version of
Unix were you running?  How often does it bite you?  I have a few pet
hypotheses but really no way to verify them.  Anyway, I am mainly curious if
others have observed this, too.

Rob Stampfli
att!cbnews!res (work)
osu-cis!n8emr!kd8wk!res (home)

jcm@mtunb.ATT.COM (was-John McMillan) (02/14/89)

In article <3986@cbnews.ATT.COM> res@cbnews.ATT.COM (Robert E. Stampfli) writes:
>I have noticed that occasionally I will get a core dump immediately after
>invoking a program on the Unix-PC, usually upon trying to run it after
>it has just finished compiling.  Simply reinvoking the program without
>any changes results in it running fine.  This can not be attributed to a
>hardware glitch, as I have seen this on at least three machines I have access
>to.  Each has different disks, memory configuration, etc.  They were,
>however, all running version 3.5 of the Unix-PC operating system. 
			      ^^^
O sigh.  Amongst the wonders of the world are the incredibly perverse
ways in which Virtual Memory has been created.  {Expletives deleted}

The 3B1 paging (VM) software was developed, by Convergent Technologies,
from Berkeley concepts -- or so the documentation suggests.  If that
was so, the eventual Berkeley VM re-writes were well justified.  It is
too awful in its spaghettiosity [;-)] to believe, comprehend, or
meaningfully alter.  However:

	One defect was an erroneous line in the context-switcher
	-- supported by misleading commenting -- that produced
	errors when a shared-memory [SM] process swapped out and
	the new process had text overlapping the previous SM
	address- space.  This error was identified (what phun)
	and fixed (trivial) AFTER the 3.51 release.

Other than the above, it's probably YOUR own fault! %-)
However, relatively little use is made of indirect jumps that occur
in DATA (ie., changable) space, so I presume you've hit IT.

WORKAROUND:	Move your Shared Memory to a high, declared address
		-- where it cannot reasonable collide with TEXT or
		stack.

I'm still NOT satisfied with the 3.51c kernel, and the broader
3.51C system-wide fix-disk seems to be held off until other
anomalies are fixed.

jc mcmillan	-- att!mtunb!jcm	-- o my, what pretty flames those are!

ditto@cbmvax.UUCP (Michael "Ford" Ditto) (02/15/89)

In article <3986@cbnews.ATT.COM> res@cbnews.ATT.COM (Robert E. Stampfli) writes:
>I have noticed that occasionally I will get a core dump immediately after
>invoking a program on the Unix-PC, usually upon trying to run it after
>it has just finished compiling.

I have seen this happen on a few different types of machines running
System V Release 2.  I think it is a bug related to execing a file which
has pending delayed-write blocks in the cache.  I have only seen it
happen when the program is invoked immediately after it is written, as
in the shell command "make foo && foo".

I have seen it happen on the Unix PC when compiling gcc, because of the
way it compiles a program like "genoutput" and then runs it to generate
some more code.
-- 
					-=] Ford [=-

"The number of Unix installations	(In Real Life:  Mike Ditto)
has grown to 10, with more expected."	ford@kenobi.cts.com
- The Unix Programmer's Manual,		...!sdcsvax!crash!elgar!ford
  2nd Edition, June, 1972.		ditto@cbmvax.commodore.com

lenny@icus.islp.ny.us (Lenny Tropiano) (02/15/89)

In article <3986@cbnews.ATT.COM> res@cbnews.ATT.COM (Robert E. Stampfli) writes:
|>I have noticed that occasionally I will get a core dump immediately after
|>invoking a program on the Unix-PC, usually upon trying to run it after
|>it has just finished compiling.  Simply reinvoking the program without
|>any changes results in it running fine.  This can not be attributed to a
|>hardware glitch, as I have seen this on at least three machines I have access
|>to.  Each has different disks, memory configuration, etc.  They were,
|>however, all running version 3.5 of the Unix-PC operating system. 
|>
...
I have seen and experienced the same occurence.  I'm pretty sure it's not
OS release dependent.  I've seen this on 3B2's as well.  Usually it happens
after you execute something on a system that was just written out (you
explanation of compiling a executable, and then executing it).  It usually
will end up with a "Illegal Instruction or Memory Fault" and will dump core.
This happens on occasion because "writes" on UNIX machines are delayed.  
The kernel could write directly to the disk for all filesystem accesses, but
the system response time and throughput would be unacceptable.  Therefore,
to increase performance, disk access is done through internal data buffers
called the "buffer cache".  The kernel will minimize the frequence of writes
to the disk, if it determines that the data is transient, and will soon
be overwritten anyhow.   Basically the kernel will hope that someone will
change the contents of the internal buffer (ie. the data to be written to the
disk) before it is actually written.

The disadvantages to delayed write is that the user issuing the "write" system
call is never sure when the data finally is written to the disk/media.
Issuing a "sync" system call or using the "sync" command will assure that
all dirty buffers are written to disk.

|>Let me stress that this is a rare occurence -- I have only seen it about
|>six times, and I do a lot of compiling and testing on the Unix-PC.
...
It's very rare.  I would have trouble producing the error.  I had to stick
"sync" system calls into my B-tree routines at work at strategic points to
assure that the data is written out to the disk, hopefully not destroying
the performance of the B-tree routines.

|>I am curious: has anyone else experienced this?  If so, what version of
|>Unix were you running?  How often does it bite you?  I have a few pet
|>hypotheses but really no way to verify them.  Anyway, I am mainly curious if
|>others have observed this, too.
|>
...
You are not alone.  I've experienced it!  It's happened on 3B1's running
3.0, 3.5, and 3.51[ac].  I've also experienced it with the 3B2 running 3.1,
but there I was able to tweek the NAUTOUP tunable parameter to decrease
the time between the system will automatically update the dirty buffers.

Hope this cleared up your "weird" happenings ...
-Lenny
-- 
Lenny Tropiano             ICUS Software Systems         [w] +1 (516) 582-5525
lenny@icus.islp.ny.us      Telex; 154232428 ICUS         [h] +1 (516) 968-8576
{talcott,decuac,boulder,hombre,pacbell,sbcs}!icus!lenny  attmail!icus!lenny
        ICUS Software Systems -- PO Box 1; Islip Terrace, NY  11752

bes@holin.ATT.COM (Bradley Smith) (02/15/89)

In article <5977@cbmvax.UUCP> ditto@cbmvax.UUCP (Michael "Ford" Ditto) writes:
>I have seen it happen on the Unix PC when compiling gcc, because of the
>way it compiles a program like "genoutput" and then runs it to generate
>some more code.
I get this with gcc quite a bit, if I use the -v flag (verbose) I
don't see this happening.....any thoughts?
-- 
Bradley Smith
Computer Systems Offer Integration Laboratory
AT&T Bell Labs, Holmdel, NJ 
201-949-0090 att!holin!bes or bes@holin.ATT.COM

alex@umbc3.UMBC.EDU (Alex S. Crain) (02/16/89)

In article <5977@cbmvax.UUCP> ditto@cbmvax.UUCP (Michael "Ford" Ditto) writes:

>I have seen it happen on the Unix PC when compiling gcc, because of the
>way it compiles a program like "genoutput" and then runs it to generate
>some more code.

	I had attributed this to a bug in make, in that make is execing
genoutput with too many open file descriptors, and genoutput being unable
to cope. Usually when this happens, genoutput spits a bunch of stuff out
to stderr before it dies, the output being what should have gone to stdout.

-- 
					:alex
Alex Crain
Systems Programmer			alex@umbc3.umbc.edu
Univ Md Baltimore County		nerwin!alex@umbc3.umbc.edu (NEW DOMAIN)

jcm@mtunb.ATT.COM (was-John McMillan) (02/16/89)

In article <610@icus.islp.ny.us> lenny@icus.islp.ny.us (Lenny Tropiano) writes:
>	...
>The disadvantages to delayed write is that the user issuing the "write" system
>call is never sure when the data finally is written to the disk/media.
>Issuing a "sync" system call or using the "sync" command will assure that
>all dirty buffers are written to disk.

1)	The previous posting I made was correct regarding the existence
	of a window of vulnerability after the context-switch-out of
	a shared-memory process.  It was incorrect when it suggested
	the principle OTHER alternative was pilot error.

2)	Lenny is mostly correct in what he said, but his anthropomorphic,
	hope-filled kernel doesn't seem, to me, a very clear explanation
	of THE PROBLEM.  And "disadvantages to the delayed write" is
	an oblique way of stating that if you abuse a concept, it fails.
	
The problem: TWO forms of file-access are being used.

	A)	The LOADER [ld(1)] is using BIO [block I/O]
		to create the files.  This involves acquiring
		the disk space from the free-list, recording
		this in the INODE for the file, and writing the
		file image to disk.

	B)	The EXEC [sys1.c:getxfile()] uses two approaches
		to acquiring program image: 1) demand-loaded (PAGEIN)
		and 2) read-at-exec.  The commonly-used SHARED-TEXT
		images are demand-loaded.  (Read-at-exec uses
		BIO, presumably doesn't fail, and is ignored below.)

	A clash between BIO and PAGEIN occurs because the latter
	uses the former's BMAP [indices of file blocks] without
	using the former's cache-checking.

Specifically:
	BIO-accesses test if a desired block is INCORE -- is in
	a cached buffer -- before touching the disk.
	PAGEIN-accesses, however, presume the file is on disk
	-- saving the 4 cache searches / page
		[ 4 * 1KB Logical Block == 4KB Page ]
	which would encounter a HIT 'so rarely as to be ignorable'.

	[ 'so rarely as to be ignorable' == The Reported #$@%^ problem. ]

	This is, to me, an abuse of the BIO -- but with a reward/price
	ratio that is acceptable.  It should ONLY affect just-compiled
	codes so production software should be reliable UNLESS they
	are doing compile/exec's -- quite a rarity.

There are no inexpensive hooks permitting a test of WHEN the inode
was written: if there were, that time could be compared to the ROOTDEV's
last s_time and a flush (bio.c:bflush()) and a sleep() -- or incore()'s
-- might be used to synchronize the two mechanisms, or at least drop the
incidence of problems another few orders of magnitude [base 10].
	1) Inode 'times' are maintained on disk, but NOT in the
		inode table;
	2) Inode 'times' are slightly unreliable given their under-mining
		by TIME(1,2), CPIO/TAR, and dead-batteries.
			
If I wasn't so strange to begin with, I'd think it was odd that
I've never noticed this in 4 years of living with the s4/7300/3b1/#@$%!
			- - - - - - - - - - - -
Perhaps the LD(1) software should have included a SYNC(2) before exiting.

Note, however, SYNC-ing is a probabilistic work-around:
	SYNC(2) is not guaranteed to be COMPLETE -- just started -- upon return.

(In that vein: perhaps LD DOES PERFORM A SYNC!   -- but I doubt it.)
			- - - - - - - - - - - -
>It's very rare.  I would have trouble producing the error.

There are numerous reasons why attempts at FORCING this problem
to occur may fail:
	1) An SMGR sync(2) may have slip through;
	2) a flush may have been coerced by other buffer needs;
	3) re-compiling and CP-ing into the target directory tends
		to use the same physical-blocks as the previous
		image: depending on the location and amount
		code-changes, the Paged-In blocks from the previous
		version may be the same as those of the about-to-be-
		written new version.
	4) Bad karma, und so weiter...
			- - - - - - - - - - - -
>							  I had to stick
>"sync" system calls into my B-tree routines at work at strategic points to
>assure that the data is written out to the disk, hopefully not destroying
>the performance of the B-tree routines.

Huh?
If you are sticking with BIO, you do not need sync's for data accesses.

If you are sticking with RAW IO, you do not need sync's for data accesses.

If you are mixing the two, you need serious help -- much more than
a SYNC will provide -- unless it's filled with interesting chemicals !^}
			- - - - - - - - - - - -
>... but there I was able to tweek the NAUTOUP tunable parameter to decrease
>the time between the system will automatically update the dirty buffers.

Huh?
Again, there seems to be an allusion to some other problem than I
think we're discussing above.  Increasing NAUTOUP is a TUNING issue
-- it is unlikely that anyone need touch this [not that I give a flying
core dump if you DO].  Those dirty ol' buffers are just fine as they are.

If there's some problem beyond JUST-COMPILED codes, let's see it raised
in detail: until then, it sounds like pilot error.
			- - - - - - - - - - - -

Hmmm... the above has really been boring.  Apologies.

jc mcmillan	-- att!mtunb!jcm	-- speaking for himself, only
		-- save an electron this week: use 'r', not 'F', if possible.

ignatz@chinet.chi.il.us (Dave Ihnat) (02/16/89)

In article <610@icus.islp.ny.us> lenny@icus.islp.ny.us (Lenny Tropiano) writes:
>The disadvantages to delayed write is that the user issuing the "write" system
>call is never sure when the data finally is written to the disk/media.
>Issuing a "sync" system call or using the "sync" command will assure that
>all dirty buffers are written to disk.
>	...
>It's very rare.  I would have trouble producing the error.  I had to stick
>"sync" system calls into my B-tree routines at work at strategic points to
>assure that the data is written out to the disk, hopefully not destroying
>the performance of the B-tree routines.

Nothing major, but I do want to quell any hopes that people may have that this
is the answer to a reliable Unix database.  Sync(2) doesn't even really
guarantee that the flush is started by the time you get control back from the
call--only that it's been scheduled.  (Thus the apocryphal "sync;sync;sync"
before rebooting Unix systems.)  This is why some vendors have added 'reliable
writes' for database applications, guaranteeing that when you return the data
is actually out on disk and not languishing in some dirty buffer.  This
behavior, incidentally, was the "reason" that Unix would "never" make it in
the business world that needs databases, since you could never guarantee the
consistency of your view of the database with that actually recorded on disk.
The success and continued expansion of such packages as Informix, Oracle, etc.,
of course, show that this wasn't nearly the huge problem that was anticipated.

jcm@mtunb.ATT.COM (was-John McMillan) (02/16/89)

Asbestos on:  The following are opinions ...

SYNC is fine for its intended use: to force all Deferred-Write blocks out
	through a healthy I/O subsystem.

In fact, it is far more reliable/reasonable than the netnews articles
mentioning it often are. ;^)

In article <7724@chinet.chi.il.us> ignatz@chinet.chi.il.us (Dave Ihnat) writes:
>...
>Nothing major, but I do want to quell any hopes that people may have that this
>is the answer to a reliable Unix database.

	Lacking any definition of "a reliable UNIX(rg) database" on
	your part, this is irrefutable.  (I am NOT requesting such a
	definition -- just pointing out you've taken the easy road.)

>						 Sync(2) doesn't even really
>guarantee that the flush is started by the time you get control back from the
>call--only that it's been scheduled.

	Perhaps a definition of "started" was in order.  By the time the
	SYNC(2) call returns, either your call has queued all the writes
	to the disk, or another process is amidst doing this -- but is
	momentarily blocked by a resource lock of some sort.

	There is no "scheduled" process which has responsibility to
	complete YOUR requested sync at an indeterminate future time.
	The indeterminate feature is whether some one else is ALREADY
	doing it, or the length of time it will take to flush out the
	scheduled writes -- typically < NBUF * .02sec.

>					 (Thus the apocryphal "sync;sync;sync"
>before rebooting Unix systems.)

	As I was told, about 15 years ago, the reason for multiple
	SYNCs was: even YOU won't make a typographical error
	3 times in a row.  Made sense.  Still does.  WHAT does your
	"Thus" mean?  I don't see any tie between the previous assertion
	and the parenthetic one.  And the parenthetic one strikes me as
	irrelevant to the issue of databases.

	(Anyone who types "sync;sync;sync;reboot" doesn't know what
	they're doing: The disk must be allowed to become quiescent
	before any reboot.  All dismountable file systems should
	be dismounted before reboots: this guarantees they are flushed,
	and that the databases are at least as consistent as their
	designers. 8-} )
	
>				 This is why some vendors have added 'reliable
>writes' for database applications, guaranteeing that when you return the data
>is actually out on disk and not languishing in some dirty buffer.

	Sounds so seedy and degenerate, I kinda like it !-)

>									 This
>behavior, incidentally, was the "reason" that Unix would "never" make it in
>the business world that needs databases, since you could never guarantee the
>consistency of your view of the database with that actually recorded on disk.

	The key words are DROPPED here: "... AFTER A SYSTEM CRASH".

	There is NO defect in the BIO subsystem -- at least none
	identified above -- in a running system.  But... there are
	several features involving the sequencing of WRITES that
	can "encourage" inconsistent databases.  For instance, the
	most heavily accessed blocks are the "last" to be flushed
	by buffer requests: this can mean your most central control
	blocks are the most vulnerable in a crash.  In another
	direction, SYNC's typically write in a head-movement
	optimized order [ELEVATOR algorithms] -- which does not
	represent any file-access sequencing and can effect
	inconsistencies.
	
>The success and continued expansion of such packages as Informix, Oracle, etc.,
>of course, show that this wasn't nearly the huge problem that was anticipated.

	Suppose so.  I DO wish the BIO and crash-recovery would be
	extended to better support everyone, including DB requirements: 
	In those Olden, Golden Daze of UNIX there were many crashes
	where you couldn't trust the consistency of ANYTHING --
	including the cached buffers.  Nowadays, crashes seem to involve
	exotic corners which usually have little bearing on the BIO
	-- "WOOPS, there went FOOPlan again", or "OK, 27000 Widgets DOES
	bring my Cray-PC down".  What would it take to record enough
	of the mount/BIO info to permit the flushing of those olde
	buffers during the primordeal [sick] BOOT phases?

Back into the cave...

jc mcmillan	-- att!mtunb!jcm	-- speaking only for himself, not THEM

sewilco@datapg.MN.ORG (Scot E Wilcoxon) (02/17/89)

In article <3986@cbnews.ATT.COM> res@cbnews.ATT.COM (Robert E. Stampfli) writes:
>I have noticed that occasionally I will get a core dump immediately after
>invoking a program on the Unix-PC, usually upon trying to run it after
>it has just finished compiling.  Simply reinvoking the program without
>...
>however, all running version 3.5 of the Unix-PC operating system. 

Under 3.5 I was having the same problem regularly.  I was compiling under
make(1) and during the compilations I'd type the name of the program.  As
soon as the make finished, the freshly compiled program would often
core dump instead of executing.  A `sleep 15` stopped the problem.
-- 
Scot E. Wilcoxon  sewilco@DataPg.MN.ORG    {amdahl|hpda}!bungia!datapg!sewilco
Data Progress 	 UNIX masts & rigging  +1 612-825-2607    uunet!datapg!sewilco
	I'm just reversing entropy while waiting for the Big Crunch.

ditto@cbmvax.UUCP (Michael "Ford" Ditto) (02/20/89)

In article <1667@umbc3.UMBC.EDU> alex@umbc3.umbc.edu.UMBC.EDU (Alex S. Crain) writes:
>In article <5977@cbmvax.UUCP> ditto@cbmvax.UUCP (Michael "Ford" Ditto) writes:
>>I have seen it happen on the Unix PC when compiling gcc, because of the
>>way it compiles a program like "genoutput" and then runs it to generate
>>some more code.
>
>	I had attributed this to a bug in make, in that make is execing
>genoutput with too many open file descriptors, and genoutput being unable
>to cope. Usually when this happens, genoutput spits a bunch of stuff out
>to stderr before it dies, the output being what should have gone to stdout.

I just did a simple test, and make doesn't seem to leave any file
descriptors other than 0,1,2 open in forked processes.  It wasn't a very
thorough test, but it did verify that 3,4,5,6,7,8 were all available with
the simple Makefile I used.

I think a program with a page or two full of garbage might do just about
anything before it died, so the 1>&2 phenomenon is somewhat believable.
-- 
					-=] Ford [=-

"The number of Unix installations	(In Real Life:  Mike Ditto)
has grown to 10, with more expected."	ford@kenobi.cts.com
- The Unix Programmer's Manual,		...!sdcsvax!crash!kenobi!ford
  2nd Edition, June, 1972.		ditto@cbmvax.commodore.com