[comp.unix.i386] Disks Hang Under 2.0.2 SCSI

steve@cdp.UUCP (12/12/89)

SUMMARY -- README
-------
We have been experiencing regular crashes running under
Interactive 2.0.2 with 3 SCSI disks on an aha1542a.  Later in
this message is a script which crashes our machine.  The
purpose of this message is to find other people who are
willing to try to replicate these crashes on various
machines.  I encourage folks to try out the script, even if
they do not have our exact hardware configuration.  This will
help us to better understand the whether the problem lies in
hardware or in 2.0.2.  

DETAILS
-------
The symptom of the crashes is that all processes continue to
run, but any process that goes for the disk hangs.  So, getty
prints the login prompt, and accepts a name at login:, but
when it goes to spawn login, the exec hangs the system.
Switch to a different virtual console, and repeat the same
thing.  emacs works fine until it tries to auto-save, open
a file, etc...

The crashes occurs intermittently -- about once a day on our
machine that averages 20 users at a time and is up 24 hours a
day.

On an alternate machine, I have developed a little program
that can reliably crash the machine with 3 or more disks within
about 2 seconds of invocation.  The program looks like this :

-----------------------------------------------------------
:
# crashix.sh
#
#   hangs  i/386 2.0.2 SCSI driver within 2 seconds.
#   Assume root partition on /dev/dsk/0s1, other
#   partitions on /dev/dsk/1s3 and  /dev/dsk/2s3.
#
sync
sync
dd if=/dev/dsk/0s1 bs=128k >/dev/null &
dd if=/dev/dsk/1s3 bs=128k >/dev/null &
dd if=/dev/dsk/2s3 bs=128k >/dev/null &
----------------------------------------------------------

Notes :
   o It is important to have all of the dd's in the background,
     so that they all have the same priority.  The occurence
     of crashes is related disk use intensity; having processes
     with lower priority reduces the disk use intensity and crash
     frequency.

   o You may have different partitioning on your disks.  Change
     the device names to suit your configuration.  Try it with
     2 3 or 4 disks.

   o Our hardware configuration is :
       - Mylex MI-386/20 motherboard, 8MB RAM on board.

       - aha1542a SCSI host adapter
          4 CDC 94161 150MB SCSI disks
          1 Archive 2150s "VIPER" cartridge drive

       - a hercules card.

   o I have tried crashix.sh with three  other motherboards
     (two based on recent Chips & Technologies chip sets),
     and the behaviour is the same (crash within 2 seconds).


Support at interactive OEM division has returned calls, but
skeptical that this is software-related.  

Thanks for your help.  If people that test their machines
keep me informed, I will post the results to the net.  


Steve Fram

Chief Programmer
Community Data Processing (CdP)
{hplabs, pyramid, ...}!cdp!steve
(415)-322-9069

jgd@rsiatl.UUCP (John G. De Armond) (12/13/89)

In article <654400003@cdp> steve@cdp.UUCP writes:
>
>
>SUMMARY -- README
>-------
>We have been experiencing regular crashes running under
>Interactive 2.0.2 with 3 SCSI disks on an aha1542a.  Later in
>this message is a script which crashes our machine.  The
>purpose of this message is to find other people who are
>willing to try to replicate these crashes on various
>machines.  I encourage folks to try out the script, even if
>they do not have our exact hardware configuration.  This will
>help us to better understand the whether the problem lies in
>hardware or in 2.0.2.  
>
>DETAILS
>-------
>The symptom of the crashes is that all processes continue to
>run, but any process that goes for the disk hangs.  So, getty
>prints the login prompt, and accepts a name at login:, but
>when it goes to spawn login, the exec hangs the system.
>Switch to a different virtual console, and repeat the same
>thing.  emacs works fine until it tries to auto-save, open
>a file, etc...


Steve, 

We have had the same failure here under similiar conditions.  Configuration
here is an Adaptec host adaptor and 2 380 mb Newbury data drives.  

Our problem seemed to manifest itself mostly under pathalogical conditions,
such as when a bad block is discovered.  I've also seen it when I've been
running a script similiar to yours designed to hammer a new hard disk before
putting it into service.

The external symptoms are as you note PLUS I notice that the activity LED
on the Adaptec board is stuck on AND the activity LED on one of the drives
is on continously.

We now have a bit more data in that it occurs on two totally different
drive types.

Without any investigation other than external observation, I suspect that the 
problem has to do with either a buffer getting overrun or a problem with
a task releasing the scsi bus to another one.

The fact that the problem only occurs either when 2 drives are heavily loaded
or when an error condition happens - which appears from the LED activity 
to tie the bus up for a spell - should be a major clue.  I absolutely
cannot cause this failure by any combination of loading on one drive.

John

-- 
John De Armond, WD4OQC                     | The Fano Factor - 
Radiation Systems, Inc.     Atlanta, GA    | Where Theory meets Reality.
emory!rsiatl!jgd          **I am the NRA** | 

steve@cdp.UUCP (12/14/89)

This is a followup on a posting I made a couple days ago, about
being able to easily crash interactive 2.0.2, by reading from
3 SCSI disks simultaneously (running an aha1542a controller).

I have replicated the crash on a compaq 386/20e.  This was
sufficient for interactive to "validate" the bug report --
i.e., they consider it a driver bug.  There is no committment
on their part to fix it, but the L.A. support person continues
to be communicative and sympathetic.  Hollis doesn't return
my phone calls.

We have now come up with a program that will crash 2.0.2 with
just 2 SCSI disks and 1 SCSI tape.  I suspect that this is a
more standard configuration.  The tape drive is an archive
2150s (with ROM revisions as recommended in the interactive
1.0.6 release notes).  The disks are CDC 94161.  Alas, this
program takes between 10 minutes and 1 hour to hang the disk
driver (as opposed to the 3 disk version, which hung the driver
in < 2 seconds).

Does anyone have experience running the future domain
controller with 3 or more SCSI disks ?  We are considering
replacing our SCSI disks with 2 high capacity ESDI disks (> 600
MB), running the Western Digital wd1007v-se2 (replacement for
WD1007).  Does anyone have experience (good or bad) with such a
configuration ?


Steve Fram
Chief Programmer
Community Data Processing (CdP)
{pyramid, hplabs, ...}!cdp!steve

-------------------------------- cut here --------------------------------
:
# crashix2.sh
#
#   crash interactive 2.0.2 running just 2 SCSI disks and 1
#   SCSI tape.
#

# disk parameters
dd_dev1=/dev/rdsk/0s1			# root
dd_dev2=/dev/rdsk/1s3			# one partition on disk
dd_count=1000
dd_bs=128k

# tape parameters
tape_dev=/dev/ct
tape_bs=32k

while :
   do	echo "New dd loop..."

	dd if=$dd_dev1 of=/dev/null bs=$dd_bs count=$dd_count 2>/dev/null &
	ddp1=$!
	dd if=$dd_dev2 of=/dev/null bs=$dd_bs count=$dd_count 2>/dev/null &
	ddp2=$!

	while kill -0 $dd_p1 || kill -0 $dd_p2
	   do	sleep 10
	done 2>/dev/null

done &

while :
   do	echo "New tape loop..."
	dd if=$tape_dev of=/dev/null bs=$tape_bs
done &

kmoore@shiloh.UUCP (kirk moore) (12/15/89)

I have been running a WD7000fasst card with a 280 meg Newbury (scsi)

No problems at all. 


-- 
Kirk Moore --- Bellevue, WA ---
uunet!pilchuck!dataio!-------\
uw-beaver!uw-entropy!dataio!-----shiloh!kmoore
shiloh  --- Bellevue, WA --- (206) 562-1561(board) - (206) 747-5709(voice)

neese@adaptex.UUCP (12/15/89)

>We have had the same failure here under similiar conditions.  Configuration
>here is an Adaptec host adaptor and 2 380 mb Newbury data drives.  
>
>Our problem seemed to manifest itself mostly under pathalogical conditions,
>such as when a bad block is discovered.  I've also seen it when I've been
>running a script similiar to yours designed to hammer a new hard disk before
>putting it into service.
>
>The external symptoms are as you note PLUS I notice that the activity LED
>on the Adaptec board is stuck on AND the activity LED on one of the drives
>is on continously.
>
>We now have a bit more data in that it occurs on two totally different
>drive types.
>
>Without any investigation other than external observation, I suspect that the 
>problem has to do with either a buffer getting overrun or a problem with
>a task releasing the scsi bus to another one.
>
>The fact that the problem only occurs either when 2 drives are heavily loaded
>or when an error condition happens - which appears from the LED activity 
>to tie the bus up for a spell - should be a major clue.  I absolutely
>cannot cause this failure by any combination of loading on one drive.

The problem, in this instance anyway,, is the Newbury drives.  Newbury
SCSI hard drives do not correctly support SCSI bus arbitration.  That
is what causes the hang condition when there is more than one drive in the
system.  I had another customer that had the same problem and Newbury
finally admitted the problem.

			Roy Neese
			Adaptec Central Field Applications Engineer
			UUCP @ {texbell,attctc}!cpe!adaptex!neese
				merch!adaptex!neese
				uunet!swbatl!texbell!merch!adaptex!neese

neese@adaptex.UUCP (12/15/89)

>This is a followup on a posting I made a couple days ago, about
>being able to easily crash interactive 2.0.2, by reading from
>3 SCSI disks simultaneously (running an aha1542a controller).
>
>I have replicated the crash on a compaq 386/20e.  This was
>sufficient for interactive to "validate" the bug report --
>i.e., they consider it a driver bug.  There is no committment
>on their part to fix it, but the L.A. support person continues
>to be communicative and sympathetic.  Hollis doesn't return
>my phone calls.
>
>We have now come up with a program that will crash 2.0.2 with
>just 2 SCSI disks and 1 SCSI tape.  I suspect that this is a
>more standard configuration.  The tape drive is an archive
>2150s (with ROM revisions as recommended in the interactive
>1.0.6 release notes).  The disks are CDC 94161.  Alas, this
>program takes between 10 minutes and 1 hour to hang the disk
>driver (as opposed to the 3 disk version, which hung the driver
>in < 2 seconds).
>
>Does anyone have experience running the future domain
>controller with 3 or more SCSI disks ?  We are considering
>replacing our SCSI disks with 2 high capacity ESDI disks (> 600
>MB), running the Western Digital wd1007v-se2 (replacement for
>WD1007).  Does anyone have experience (good or bad) with such a
>configuration ?

Just to alleviate any concerns.  The 154x host adapters support up to
the maximum number of SCSI devices you can have (7 targets * 8 LUN's).
This has been verified in many ways.  I have had as many as 6 hard drives
on my SCO 2.3GT system, and ran this test you suppiled with no problems.
I let it run for 2 days.  I expanded it to hit all 6 drives and still
no problems.  Good test though.

			Roy Neese
			Adaptec Central Field Applications Engineer
			UUCP @ {texbell,attctc}!cpe!adaptex!neese
				merch!adaptex!neese
				uunet!swbatl!texbell!merch!adaptex!neese

steve@corpane.UUCP (Steve Snow) (12/15/89)

>In article <654400003@cdp> steve@cdp.UUCP writes:
>>
>>
>>SUMMARY -- README
>>-------
>>We have been experiencing regular crashes running under
>>Interactive 2.0.2 with 3 SCSI disks on an aha1542a.  Later in
>>this message is a script which crashes our machine.  The
>>-------

>Steve, 

>We have had the same failure here under similiar conditions.  Configuration
>here is an Adaptec host adaptor and 2 380 mb Newbury data drives.  

We are running ISC 2.0.2 here on an Acer System 15 with two 600meg Micropolis
drives. We have however replaced the ISC SCSI driver with the Chantel SCSI
driver. I ran the test script for well over 6 minutes with no problems at all.
Sounds like to me your problem is in the ISC driver. I highly recommend the
Chantel driver since it supports 8mm tape drives and optical disks which we
use for backups. The driver has been very solid and instalation was easy.

Steve Snow
-- 
Steve Snow| Corpane Industries  | DISK Inc.                | DISK   300-1200bd
          | 10100 Bluegrass Pkwy| 5716 Outer Loop          |   (502)968-5401
          | Louisville, KY 40299| Louisville, KY 40219     |        thru
          | ..ukma!corpane!steve| ..ukma!corpane!disk!steve|   (502)968-5406

neese@adaptex.UUCP (12/17/89)

>I have been running a WD7000fasst card with a 280 meg Newbury (scsi)
>
>No problems at all. 

You won't have any problems until you add another SCSI drive.  It doesn't
matter whose drive you add, if there is a Newbury drive in the system,
it will have problems, regardless of whose controller (adapter) you use,
unless the adapter is only capable of doing single-threaded I/O.


			Roy Neese
			Adaptec Central Field Applications Engineer
			UUCP @ {texbell,attctc}!cpe!adaptex!neese
				merch!adaptex!neese
				uunet!swbatl!texbell!merch!adaptex!neese

kmoore@shiloh.UUCP (kirk moore) (12/20/89)

Try running the W7000fasst Card with the custom drivers from Columbia Data products. If you are interested I will repost the Number and Address for CDP...

-- 
Kirk Moore --- Bellevue, WA ---
uunet!pilchuck!dataio!-------\
uw-beaver!uw-entropy!dataio!-----shiloh!kmoore
shiloh  --- Bellevue, WA --- (206) 562-1561(board) - (206) 747-5709(voice)