[comp.sys.next] kernel corruption on 330Mb hd???

2FHGKINGLY@kuhub.cc.ukans.edu (11/10/89)

Has anyone had any problems with your NeXT not booting up off the hard
drive?  What happens is the machine locks up during the boot process.
You fix the problem by booting off optical and copying the sdmach
file from optical to the hd.  This has happened to me once and others
three more times.  Any suggestions, comments?

Blake Hughes, undergrad, University of Kansas

eht@f.word.cs.cmu.edu (Eric Thayer) (11/10/89)

In article <17504@kuhub.cc.ukans.edu> 2FHGKINGLY@kuhub.cc.ukans.edu writes:
>Has anyone had any problems with your NeXT not booting up off the hard
>drive?  What happens is the machine locks up during the boot process.
>You fix the problem by booting off optical and copying the sdmach
>file from optical to the hd.  This has happened to me once and others
>three more times.  Any suggestions, comments?

If you get the waiting for SCSI to become ready ............... I've seen
this before a couple of times.

>
>Blake Hughes, undergrad, University of Kansas

-- 
Eric H. Thayer      School of Computer Science, Carnegie Mellon
(412) 268-7679      5000 Forbes Ave, Pittsburgh, PA 15213

rogerj@batcomputer.tn.cornell.edu (Roger Jagoda) (11/10/89)

In article <6906@pt.cs.cmu.edu> eht@f.word.cs.cmu.edu (Eric Thayer) writes:
>In article <17504@kuhub.cc.ukans.edu> 2FHGKINGLY@kuhub.cc.ukans.edu writes:
>>Has anyone had any problems with your NeXT not booting up off the hard
>>drive?  What happens is the machine locks up during the boot process.
>>You fix the problem by booting off optical and copying the sdmach
>>file from optical to the hd.  This has happened to me once and others
>>three more times.  Any suggestions, comments?
>
>If you get the waiting for SCSI to become ready ............... I've seen
>this before a couple of times.
>
>>
>>Blake Hughes, undergrad, University of Kansas
>
>-- 
>Eric H. Thayer      School of Computer Science, Carnegie Mellon
>(412) 268-7679      5000 Forbes Ave, Pittsburgh, PA 15213

 
Well, I've got another piece of bad news. I was talking to my
NeXT sales rep yesterday. He had just gotten his 40 MB accelerator
drives. Guess what, they ARE Quantum drives and he didn't know
which firmware these drives had. Sigh, I think Apple just stuck
it to their old buddy Steve. Maybe someone from NeXT could check with
Quantum to see whether they're ready for a HOLE bunch of returns!
 
Roger Jagoda
System Support
Cornell University
FQOJ@CORNELLA.CIT.CORNELL.EDU
 

madler@tybalt.caltech.edu (Mark Adler) (11/11/89)

Yep, I've seen just that happen a few times to some NeXT's here.  It seems
to be contagious since it happens to a few machines connected over ethernet
at the same time (but not all of them?).  It's never happened (fingers crossed)
to my standalone NeXT (no net connection).  I have no suggestions.

Mark Adler

feldman@umd5.umd.edu (Mark Feldman) (11/11/89)

In article <17504@kuhub.cc.ukans.edu> 2FHGKINGLY@kuhub.cc.ukans.edu writes:
...
>What happens is the machine locks up during the boot process.
>You fix the problem by booting off optical and copying the sdmach
>file from optical to the hd. 
...
>Blake Hughes, undergrad, University of Kansas

I've had three systems corrupted the same way.  It's a bug.  Many others
have suffered the same bug.  As of yet, no one knows what is causing the
boot file to become corrupted, let alone how to prevent it.

	Mark

gerrit@nova.cc.purdue.edu (Gerrit) (11/11/89)

In article <17504@kuhub.cc.ukans.edu> 2FHGKINGLY@kuhub.cc.ukans.edu writes:
>Has anyone had any problems with your NeXT not booting up off the hard
>drive?  What happens is the machine locks up during the boot process.
>You fix the problem by booting off optical and copying the sdmach
>file from optical to the hd.  This has happened to me once and others
>three more times.  Any suggestions, comments?

This is a real problem and NeXT is aware of it.  The current theory
is that some user level process is opening /sdmach carelessly for
write and "accidentally" dropping garbage.  The symptoms seem to be
that a block of zeroes is written at the beginning of the data segment
and on the next reboot, the machine hangs after printing out the
memory configuration.

The fix listed above is currently the best workaround for the problem,
so keep a distribution OD within your reach for a while.

NeXT has a few sites running some tests hoping to isolate the faulty
program.  Once that is done, you should expect to see an updated version
of the faulty bugger, possibly in the archives on cc.purdue.edu, possibly
available via email, possibly available more directly from NeXT.

gerrit

callisto@blake.acs.washington.edu (Finn) (11/12/89)

In article <12609@cit-vax.Caltech.Edu> madler@tybalt.caltech.edu.UUCP (Mark Adler) writes:
>
>Yep, I've seen just that happen a few times to some NeXT's here.  It seems
>to be contagious since it happens to a few machines connected over ethernet
>at the same time (but not all of them?).  It's never happened (fingers crossed)
>to my standalone NeXT (no net connection).  I have no suggestions.
>
>Mark Adler

 Dare I ask...
 Virus????

ali@polya.Stanford.EDU (Ali T. Ozer) (11/12/89)

In article <5604@umd5.umd.edu> feldman@umd5.umd.edu (Mark Feldman) writes:
>In article <17504@kuhub.cc.ukans.edu> 2FHGKINGLY@kuhub.cc.ukans.edu writes:
>>What happens is the machine locks up during the boot process.
>>You fix the problem by booting off optical and copying the sdmach
>>file from optical to the hd. 
>I've had three systems corrupted the same way.  It's a bug.  Many others
>have suffered the same bug.  As of yet, no one knows what is causing the
>boot file to become corrupted, let alone how to prevent it.

Yes, this is a bug.  NeXT is working on it.

If your system freezes up during the boot process, after announcing the
amount of memory and possibly the number of buffers used, then you might
be bitten by this bug.  You will need to boot from a 1.0 optical to fix
things; please diff your /sdmach file against the good one from the OD;
if they are different copy the one from the OD oveer the corrupted one.

If you can duplicate the problem, please send me mail and I'll get it to
the OS engineers.

Ali

divine@gargoyle.uchicago.edu (Dwight Divine) (11/16/89)

	Please be patient with me, as this is my first post to a net.  I seem 
to have suffered this kernel corruption problem, with a nasty twist.  Not only
does the machine fail to boot from the hard drive (locking up shortly after
checking the RAM), but it will not boot from the 1.0 System floptical.  When I
use the Mach boot-from-floptical command, the disk begins to load in, but 
after setting up several of the daemons, I receive the message that the window
servers cannot be accessed/opened.  The boot-up locks at this point, but I
*am* able to power down using the power switch.  Still, I cannot successfully
boot up, and thus cannot fix the kernel corruption problem as per the suggested
fix posted in the various articles about this problem.
	This floptical problem could be due to a variety of things which have 
no relation to the original kernel problem, and user error has not as yet been
ruled out.  However, has anyone had this happen to them?  If so, I would 
greatly appreciate any input anyone has to offer.  
	Thanks much for the time.

						With Thanks,
						Dwight Divine
						(div3@tank)
						NeXT System Administrator
						Usite, U of Chicago

ali@polya.Stanford.EDU (Ali T. Ozer) (11/16/89)

In article <12811@polya.Stanford.EDU> Ali T. Ozer writes:
>In article <5604@umd5.umd.edu> feldman@umd5.umd.edu (Mark Feldman) writes:
>>I've had three systems corrupted the same way.  It's a bug.  Many others
>>have suffered the same bug.  As of yet, no one knows what is causing the
>>boot file to become corrupted, let alone how to prevent it.
>Yes, this is a bug.  NeXT is working on it.

The bug has been discovered and there is a workaround, in fact, an incredibly
simple one.  Launch a Shell, become root, and remove the executable bit on 
your kernel:

	su
	[type password]
	chmod a-x /sdmach

The problem occurs if you try to launch an executable in the Mach preload
format; depending on how the pages our laid out in the file, a part of the
file might become corrupted if paging occurs after the file is "launched."

Mach preload executables are meant to be bootable images and are not meant
to be executed by the demand-paged system; thus your system will not lose
any functionality when you remove the executable bit.  You will just be
assuring that the kernel is not launched inadvertently (either from the
Shell or with a double-click), which is probably what caused the
problem in all cases.  

There are only two preload format files in the system, the kernel and the boot 
file.  The boot file has been shipped without the executable bit so it's fine.

Thanks to Alan Marcum and Avie for the explanation and workaround.

Ali

	

feldman@umd5.umd.edu (Mark Feldman) (11/17/89)

In article <12837@polya.Stanford.EDU> ali@Polya.Stanford.EDU (Ali T. Ozer) 
writes:
...
>The bug has been discovered and there is a workaround, in fact, an incredibly
>simple one.  Launch a Shell, become root, and remove the executable bit on 
>your kernel:
>
>	su
>	[type password]
>	chmod a-x /sdmach

Ok, I read it, I did it, but I'm not very happy about the implications.

>The problem occurs if you try to launch an executable in the Mach preload
>format; depending on how the pages our laid out in the file, a part of the
>file might become corrupted if paging occurs after the file is "launched."

The files /sdmach and /odmach (which are the same file) are owned by root
and their permissions are 555 -- readable and executable by all, writable by
none.  How is it that the file can be written to when it is executed by a
user other than root?

>Mach preload executables are meant to be bootable images and are not meant
>to be executed by the demand-paged system; thus your system will not lose
>any functionality when you remove the executable bit.  You will just be
>assuring that the kernel is not launched inadvertently (either from the
>Shell or with a double-click), which is probably what caused the
>problem in all cases.  

The fact that it is possible to write to a file when you don't have
permission is very bad.  Very, very bad.  And why would the system ever try
to page back to a program file?  Me thought that that is what a swap file
was for.

>There are only two preload format files in the system, the kernel and the boot 
>file.  The boot file has been shipped without the executable bit so it's fine.

Not fine.  Getting an error back when trying to execute one of these files
would be fine.  Getting a core dump would be ok.  Having the original, write
protected file corrupted is not.

>Thanks to Alan Marcum and Avie for the explanation and workaround.
>
>Ali
>

Ali, if the fix will keep my kernels from being corrupted, thanks!  If it's
one thing that I can't stand, it's a corrupted kernel.  But what am I
missing?

  Mark


p.s.

If someone has a NeXT and does not have USENET access, how will they find
out about the fix?

ali@polya.Stanford.EDU (Ali T. Ozer) (11/17/89)

In article <5631@umd5.umd.edu> feldman@umd5.umd.edu (Mark Feldman) writes:
>In article <12837@polya.Stanford.EDU> I wrote:
>>The bug has been discovered and there is a workaround ...
>>The problem occurs if you try to launch an executable in the Mach preload
>>format; depending on how the pages our laid out in the file, a part of the
>>file might become corrupted if paging occurs after the file is "launched."
>
>The files /sdmach and /odmach (which are the same file) are owned by root
>and their permissions are 555 -- readable and executable by all, writable by
>none.  How is it that the file can be written to when it is executed by a
>user other than root?

This is a bug, after all --- and bugs break rules.  

The bug will only occur if you try to execute a preload format file, and even 
then only under special circumstances, which the sdmach file exhibits. 
This bug will not occur when executing normal demand-paged executables or
trying to execute other non-executable files.  

Again --- this is not a file system bug but rather a bug in the program loader
trying to load a preload format file.  sdmach is the only file in the system
that will cause this bug to occur.

>If someone has a NeXT and does not have USENET access, how will they find
>out about the fix?

NeXT is getting the news out to customers through various other channels.

Ali

dcarpent@sjuphil.uucp (D. Carpenter) (11/17/89)

>>If someone has a NeXT and does not have USENET access, how will they find
>>out about the fix?
>
>NeXT is getting the news out to customers through various other channels.
>
>Ali
What other channels?  Does NeXT have any regular means for communicating
with its customers?  All I know is what I read in this newsgroup or in
the newspapers and trade press.  Being a NeXT owner without at the same
time being a NeXT support person can leave one feeling rather isolated.
-- 
===============================================================
David Carpenter            dcarpent@sjuphil.UUCP                    
St. Joseph's University    dcarpent%sjuphil.sju.edu@relay.cs.net    
Philadelphia, PA  19131    ST_JOSEPH@HVRFORD.BITNET                

jsb@panix.UUCP (J. S. B'ach) (11/17/89)

In article <12609@cit-vax.Caltech.Edu> madler@tybalt.caltech.edu.UUCP (Mark Adler) writes:
)
)Yep, I've seen just that happen a few times to some NeXT's here.  It seems
)to be contagious since it happens to a few machines connected over ethernet
)at the same time (but not all of them?).  It's never happened (fingers crossed)
)to my standalone NeXT (no net connection).  I have no suggestions.
Well it happened to mine.  And I was booting from the optical drive.  The fix was
to resore from the hard drive.  NeXT is aware of the bug and asked to be sent
copies of the corrupt kernel to help them debug.
-- 
	rutgers!cmcl2!panix!jsb  	 or (more reliably) uunet!actnyc!jsb
"There aren't enough men around.  Every time there's a plane accident, it's 100 men
 dead or it's a troop transport, and I literally think, 'Why couldn't some women
 have been on that flight?'" - Helen Gurley Brown