res@cbnews.ATT.COM (Robert E. Stampfli) (02/14/89)
I have noticed that occasionally I will get a core dump immediately after invoking a program on the Unix-PC, usually upon trying to run it after it has just finished compiling. Simply reinvoking the program without any changes results in it running fine. This can not be attributed to a hardware glitch, as I have seen this on at least three machines I have access to. Each has different disks, memory configuration, etc. They were, however, all running version 3.5 of the Unix-PC operating system. Let me stress that this is a rare occurrance -- I have only seen it about six times, and I do a lot of compiling and testing on the Unix-PC. Examination of the cores show nothing useful -- infact, it is sometimes hard to explain how the code could possibly have faulted given the registers taken from the core file. I am curious: has anyone else experienced this? If so, what version of Unix were you running? How often does it bite you? I have a few pet hypotheses but really no way to verify them. Anyway, I am mainly curious if others have observed this, too. Rob Stampfli att!cbnews!res (work) osu-cis!n8emr!kd8wk!res (home)
jcm@mtunb.ATT.COM (was-John McMillan) (02/14/89)
In article <3986@cbnews.ATT.COM> res@cbnews.ATT.COM (Robert E. Stampfli) writes: >I have noticed that occasionally I will get a core dump immediately after >invoking a program on the Unix-PC, usually upon trying to run it after >it has just finished compiling. Simply reinvoking the program without >any changes results in it running fine. This can not be attributed to a >hardware glitch, as I have seen this on at least three machines I have access >to. Each has different disks, memory configuration, etc. They were, >however, all running version 3.5 of the Unix-PC operating system. ^^^ O sigh. Amongst the wonders of the world are the incredibly perverse ways in which Virtual Memory has been created. {Expletives deleted} The 3B1 paging (VM) software was developed, by Convergent Technologies, from Berkeley concepts -- or so the documentation suggests. If that was so, the eventual Berkeley VM re-writes were well justified. It is too awful in its spaghettiosity [;-)] to believe, comprehend, or meaningfully alter. However: One defect was an erroneous line in the context-switcher -- supported by misleading commenting -- that produced errors when a shared-memory [SM] process swapped out and the new process had text overlapping the previous SM address- space. This error was identified (what phun) and fixed (trivial) AFTER the 3.51 release. Other than the above, it's probably YOUR own fault! %-) However, relatively little use is made of indirect jumps that occur in DATA (ie., changable) space, so I presume you've hit IT. WORKAROUND: Move your Shared Memory to a high, declared address -- where it cannot reasonable collide with TEXT or stack. I'm still NOT satisfied with the 3.51c kernel, and the broader 3.51C system-wide fix-disk seems to be held off until other anomalies are fixed. jc mcmillan -- att!mtunb!jcm -- o my, what pretty flames those are!
ditto@cbmvax.UUCP (Michael "Ford" Ditto) (02/15/89)
In article <3986@cbnews.ATT.COM> res@cbnews.ATT.COM (Robert E. Stampfli) writes: >I have noticed that occasionally I will get a core dump immediately after >invoking a program on the Unix-PC, usually upon trying to run it after >it has just finished compiling. I have seen this happen on a few different types of machines running System V Release 2. I think it is a bug related to execing a file which has pending delayed-write blocks in the cache. I have only seen it happen when the program is invoked immediately after it is written, as in the shell command "make foo && foo". I have seen it happen on the Unix PC when compiling gcc, because of the way it compiles a program like "genoutput" and then runs it to generate some more code. -- -=] Ford [=- "The number of Unix installations (In Real Life: Mike Ditto) has grown to 10, with more expected." ford@kenobi.cts.com - The Unix Programmer's Manual, ...!sdcsvax!crash!elgar!ford 2nd Edition, June, 1972. ditto@cbmvax.commodore.com
lenny@icus.islp.ny.us (Lenny Tropiano) (02/15/89)
In article <3986@cbnews.ATT.COM> res@cbnews.ATT.COM (Robert E. Stampfli) writes: |>I have noticed that occasionally I will get a core dump immediately after |>invoking a program on the Unix-PC, usually upon trying to run it after |>it has just finished compiling. Simply reinvoking the program without |>any changes results in it running fine. This can not be attributed to a |>hardware glitch, as I have seen this on at least three machines I have access |>to. Each has different disks, memory configuration, etc. They were, |>however, all running version 3.5 of the Unix-PC operating system. |> ... I have seen and experienced the same occurence. I'm pretty sure it's not OS release dependent. I've seen this on 3B2's as well. Usually it happens after you execute something on a system that was just written out (you explanation of compiling a executable, and then executing it). It usually will end up with a "Illegal Instruction or Memory Fault" and will dump core. This happens on occasion because "writes" on UNIX machines are delayed. The kernel could write directly to the disk for all filesystem accesses, but the system response time and throughput would be unacceptable. Therefore, to increase performance, disk access is done through internal data buffers called the "buffer cache". The kernel will minimize the frequence of writes to the disk, if it determines that the data is transient, and will soon be overwritten anyhow. Basically the kernel will hope that someone will change the contents of the internal buffer (ie. the data to be written to the disk) before it is actually written. The disadvantages to delayed write is that the user issuing the "write" system call is never sure when the data finally is written to the disk/media. Issuing a "sync" system call or using the "sync" command will assure that all dirty buffers are written to disk. |>Let me stress that this is a rare occurence -- I have only seen it about |>six times, and I do a lot of compiling and testing on the Unix-PC. ... It's very rare. I would have trouble producing the error. I had to stick "sync" system calls into my B-tree routines at work at strategic points to assure that the data is written out to the disk, hopefully not destroying the performance of the B-tree routines. |>I am curious: has anyone else experienced this? If so, what version of |>Unix were you running? How often does it bite you? I have a few pet |>hypotheses but really no way to verify them. Anyway, I am mainly curious if |>others have observed this, too. |> ... You are not alone. I've experienced it! It's happened on 3B1's running 3.0, 3.5, and 3.51[ac]. I've also experienced it with the 3B2 running 3.1, but there I was able to tweek the NAUTOUP tunable parameter to decrease the time between the system will automatically update the dirty buffers. Hope this cleared up your "weird" happenings ... -Lenny -- Lenny Tropiano ICUS Software Systems [w] +1 (516) 582-5525 lenny@icus.islp.ny.us Telex; 154232428 ICUS [h] +1 (516) 968-8576 {talcott,decuac,boulder,hombre,pacbell,sbcs}!icus!lenny attmail!icus!lenny ICUS Software Systems -- PO Box 1; Islip Terrace, NY 11752
bes@holin.ATT.COM (Bradley Smith) (02/15/89)
In article <5977@cbmvax.UUCP> ditto@cbmvax.UUCP (Michael "Ford" Ditto) writes: >I have seen it happen on the Unix PC when compiling gcc, because of the >way it compiles a program like "genoutput" and then runs it to generate >some more code. I get this with gcc quite a bit, if I use the -v flag (verbose) I don't see this happening.....any thoughts? -- Bradley Smith Computer Systems Offer Integration Laboratory AT&T Bell Labs, Holmdel, NJ 201-949-0090 att!holin!bes or bes@holin.ATT.COM
alex@umbc3.UMBC.EDU (Alex S. Crain) (02/16/89)
In article <5977@cbmvax.UUCP> ditto@cbmvax.UUCP (Michael "Ford" Ditto) writes: >I have seen it happen on the Unix PC when compiling gcc, because of the >way it compiles a program like "genoutput" and then runs it to generate >some more code. I had attributed this to a bug in make, in that make is execing genoutput with too many open file descriptors, and genoutput being unable to cope. Usually when this happens, genoutput spits a bunch of stuff out to stderr before it dies, the output being what should have gone to stdout. -- :alex Alex Crain Systems Programmer alex@umbc3.umbc.edu Univ Md Baltimore County nerwin!alex@umbc3.umbc.edu (NEW DOMAIN)
jcm@mtunb.ATT.COM (was-John McMillan) (02/16/89)
In article <610@icus.islp.ny.us> lenny@icus.islp.ny.us (Lenny Tropiano) writes: > ... >The disadvantages to delayed write is that the user issuing the "write" system >call is never sure when the data finally is written to the disk/media. >Issuing a "sync" system call or using the "sync" command will assure that >all dirty buffers are written to disk. 1) The previous posting I made was correct regarding the existence of a window of vulnerability after the context-switch-out of a shared-memory process. It was incorrect when it suggested the principle OTHER alternative was pilot error. 2) Lenny is mostly correct in what he said, but his anthropomorphic, hope-filled kernel doesn't seem, to me, a very clear explanation of THE PROBLEM. And "disadvantages to the delayed write" is an oblique way of stating that if you abuse a concept, it fails. The problem: TWO forms of file-access are being used. A) The LOADER [ld(1)] is using BIO [block I/O] to create the files. This involves acquiring the disk space from the free-list, recording this in the INODE for the file, and writing the file image to disk. B) The EXEC [sys1.c:getxfile()] uses two approaches to acquiring program image: 1) demand-loaded (PAGEIN) and 2) read-at-exec. The commonly-used SHARED-TEXT images are demand-loaded. (Read-at-exec uses BIO, presumably doesn't fail, and is ignored below.) A clash between BIO and PAGEIN occurs because the latter uses the former's BMAP [indices of file blocks] without using the former's cache-checking. Specifically: BIO-accesses test if a desired block is INCORE -- is in a cached buffer -- before touching the disk. PAGEIN-accesses, however, presume the file is on disk -- saving the 4 cache searches / page [ 4 * 1KB Logical Block == 4KB Page ] which would encounter a HIT 'so rarely as to be ignorable'. [ 'so rarely as to be ignorable' == The Reported #$@%^ problem. ] This is, to me, an abuse of the BIO -- but with a reward/price ratio that is acceptable. It should ONLY affect just-compiled codes so production software should be reliable UNLESS they are doing compile/exec's -- quite a rarity. There are no inexpensive hooks permitting a test of WHEN the inode was written: if there were, that time could be compared to the ROOTDEV's last s_time and a flush (bio.c:bflush()) and a sleep() -- or incore()'s -- might be used to synchronize the two mechanisms, or at least drop the incidence of problems another few orders of magnitude [base 10]. 1) Inode 'times' are maintained on disk, but NOT in the inode table; 2) Inode 'times' are slightly unreliable given their under-mining by TIME(1,2), CPIO/TAR, and dead-batteries. If I wasn't so strange to begin with, I'd think it was odd that I've never noticed this in 4 years of living with the s4/7300/3b1/#@$%! - - - - - - - - - - - - Perhaps the LD(1) software should have included a SYNC(2) before exiting. Note, however, SYNC-ing is a probabilistic work-around: SYNC(2) is not guaranteed to be COMPLETE -- just started -- upon return. (In that vein: perhaps LD DOES PERFORM A SYNC! -- but I doubt it.) - - - - - - - - - - - - >It's very rare. I would have trouble producing the error. There are numerous reasons why attempts at FORCING this problem to occur may fail: 1) An SMGR sync(2) may have slip through; 2) a flush may have been coerced by other buffer needs; 3) re-compiling and CP-ing into the target directory tends to use the same physical-blocks as the previous image: depending on the location and amount code-changes, the Paged-In blocks from the previous version may be the same as those of the about-to-be- written new version. 4) Bad karma, und so weiter... - - - - - - - - - - - - > I had to stick >"sync" system calls into my B-tree routines at work at strategic points to >assure that the data is written out to the disk, hopefully not destroying >the performance of the B-tree routines. Huh? If you are sticking with BIO, you do not need sync's for data accesses. If you are sticking with RAW IO, you do not need sync's for data accesses. If you are mixing the two, you need serious help -- much more than a SYNC will provide -- unless it's filled with interesting chemicals !^} - - - - - - - - - - - - >... but there I was able to tweek the NAUTOUP tunable parameter to decrease >the time between the system will automatically update the dirty buffers. Huh? Again, there seems to be an allusion to some other problem than I think we're discussing above. Increasing NAUTOUP is a TUNING issue -- it is unlikely that anyone need touch this [not that I give a flying core dump if you DO]. Those dirty ol' buffers are just fine as they are. If there's some problem beyond JUST-COMPILED codes, let's see it raised in detail: until then, it sounds like pilot error. - - - - - - - - - - - - Hmmm... the above has really been boring. Apologies. jc mcmillan -- att!mtunb!jcm -- speaking for himself, only -- save an electron this week: use 'r', not 'F', if possible.
ignatz@chinet.chi.il.us (Dave Ihnat) (02/16/89)
In article <610@icus.islp.ny.us> lenny@icus.islp.ny.us (Lenny Tropiano) writes: >The disadvantages to delayed write is that the user issuing the "write" system >call is never sure when the data finally is written to the disk/media. >Issuing a "sync" system call or using the "sync" command will assure that >all dirty buffers are written to disk. > ... >It's very rare. I would have trouble producing the error. I had to stick >"sync" system calls into my B-tree routines at work at strategic points to >assure that the data is written out to the disk, hopefully not destroying >the performance of the B-tree routines. Nothing major, but I do want to quell any hopes that people may have that this is the answer to a reliable Unix database. Sync(2) doesn't even really guarantee that the flush is started by the time you get control back from the call--only that it's been scheduled. (Thus the apocryphal "sync;sync;sync" before rebooting Unix systems.) This is why some vendors have added 'reliable writes' for database applications, guaranteeing that when you return the data is actually out on disk and not languishing in some dirty buffer. This behavior, incidentally, was the "reason" that Unix would "never" make it in the business world that needs databases, since you could never guarantee the consistency of your view of the database with that actually recorded on disk. The success and continued expansion of such packages as Informix, Oracle, etc., of course, show that this wasn't nearly the huge problem that was anticipated.
jcm@mtunb.ATT.COM (was-John McMillan) (02/16/89)
Asbestos on: The following are opinions ... SYNC is fine for its intended use: to force all Deferred-Write blocks out through a healthy I/O subsystem. In fact, it is far more reliable/reasonable than the netnews articles mentioning it often are. ;^) In article <7724@chinet.chi.il.us> ignatz@chinet.chi.il.us (Dave Ihnat) writes: >... >Nothing major, but I do want to quell any hopes that people may have that this >is the answer to a reliable Unix database. Lacking any definition of "a reliable UNIX(rg) database" on your part, this is irrefutable. (I am NOT requesting such a definition -- just pointing out you've taken the easy road.) > Sync(2) doesn't even really >guarantee that the flush is started by the time you get control back from the >call--only that it's been scheduled. Perhaps a definition of "started" was in order. By the time the SYNC(2) call returns, either your call has queued all the writes to the disk, or another process is amidst doing this -- but is momentarily blocked by a resource lock of some sort. There is no "scheduled" process which has responsibility to complete YOUR requested sync at an indeterminate future time. The indeterminate feature is whether some one else is ALREADY doing it, or the length of time it will take to flush out the scheduled writes -- typically < NBUF * .02sec. > (Thus the apocryphal "sync;sync;sync" >before rebooting Unix systems.) As I was told, about 15 years ago, the reason for multiple SYNCs was: even YOU won't make a typographical error 3 times in a row. Made sense. Still does. WHAT does your "Thus" mean? I don't see any tie between the previous assertion and the parenthetic one. And the parenthetic one strikes me as irrelevant to the issue of databases. (Anyone who types "sync;sync;sync;reboot" doesn't know what they're doing: The disk must be allowed to become quiescent before any reboot. All dismountable file systems should be dismounted before reboots: this guarantees they are flushed, and that the databases are at least as consistent as their designers. 8-} ) > This is why some vendors have added 'reliable >writes' for database applications, guaranteeing that when you return the data >is actually out on disk and not languishing in some dirty buffer. Sounds so seedy and degenerate, I kinda like it !-) > This >behavior, incidentally, was the "reason" that Unix would "never" make it in >the business world that needs databases, since you could never guarantee the >consistency of your view of the database with that actually recorded on disk. The key words are DROPPED here: "... AFTER A SYSTEM CRASH". There is NO defect in the BIO subsystem -- at least none identified above -- in a running system. But... there are several features involving the sequencing of WRITES that can "encourage" inconsistent databases. For instance, the most heavily accessed blocks are the "last" to be flushed by buffer requests: this can mean your most central control blocks are the most vulnerable in a crash. In another direction, SYNC's typically write in a head-movement optimized order [ELEVATOR algorithms] -- which does not represent any file-access sequencing and can effect inconsistencies. >The success and continued expansion of such packages as Informix, Oracle, etc., >of course, show that this wasn't nearly the huge problem that was anticipated. Suppose so. I DO wish the BIO and crash-recovery would be extended to better support everyone, including DB requirements: In those Olden, Golden Daze of UNIX there were many crashes where you couldn't trust the consistency of ANYTHING -- including the cached buffers. Nowadays, crashes seem to involve exotic corners which usually have little bearing on the BIO -- "WOOPS, there went FOOPlan again", or "OK, 27000 Widgets DOES bring my Cray-PC down". What would it take to record enough of the mount/BIO info to permit the flushing of those olde buffers during the primordeal [sick] BOOT phases? Back into the cave... jc mcmillan -- att!mtunb!jcm -- speaking only for himself, not THEM
sewilco@datapg.MN.ORG (Scot E Wilcoxon) (02/17/89)
In article <3986@cbnews.ATT.COM> res@cbnews.ATT.COM (Robert E. Stampfli) writes: >I have noticed that occasionally I will get a core dump immediately after >invoking a program on the Unix-PC, usually upon trying to run it after >it has just finished compiling. Simply reinvoking the program without >... >however, all running version 3.5 of the Unix-PC operating system. Under 3.5 I was having the same problem regularly. I was compiling under make(1) and during the compilations I'd type the name of the program. As soon as the make finished, the freshly compiled program would often core dump instead of executing. A `sleep 15` stopped the problem. -- Scot E. Wilcoxon sewilco@DataPg.MN.ORG {amdahl|hpda}!bungia!datapg!sewilco Data Progress UNIX masts & rigging +1 612-825-2607 uunet!datapg!sewilco I'm just reversing entropy while waiting for the Big Crunch.
ditto@cbmvax.UUCP (Michael "Ford" Ditto) (02/20/89)
In article <1667@umbc3.UMBC.EDU> alex@umbc3.umbc.edu.UMBC.EDU (Alex S. Crain) writes: >In article <5977@cbmvax.UUCP> ditto@cbmvax.UUCP (Michael "Ford" Ditto) writes: >>I have seen it happen on the Unix PC when compiling gcc, because of the >>way it compiles a program like "genoutput" and then runs it to generate >>some more code. > > I had attributed this to a bug in make, in that make is execing >genoutput with too many open file descriptors, and genoutput being unable >to cope. Usually when this happens, genoutput spits a bunch of stuff out >to stderr before it dies, the output being what should have gone to stdout. I just did a simple test, and make doesn't seem to leave any file descriptors other than 0,1,2 open in forked processes. It wasn't a very thorough test, but it did verify that 3,4,5,6,7,8 were all available with the simple Makefile I used. I think a program with a page or two full of garbage might do just about anything before it died, so the 1>&2 phenomenon is somewhat believable. -- -=] Ford [=- "The number of Unix installations (In Real Life: Mike Ditto) has grown to 10, with more expected." ford@kenobi.cts.com - The Unix Programmer's Manual, ...!sdcsvax!crash!kenobi!ford 2nd Edition, June, 1972. ditto@cbmvax.commodore.com