[comp.sys.amiga.tech] HD validation error

rnelson@eecs.wsu.edu (Roger Nelson - Grad Student) (01/14/91)

------

Considering the number of people that have been complaining about the
header corruption/validation error, it would appear that there is a real
problem with the File System.

I usually get the error when ever two applications access the HD at the 
same time.  For example, I always get this error whenever I unarchive two 
files simultaneously with lharc to the HD. 

I now do everything in ram and copy to the HD; this is unacceptable.
I got the Amiga for its multitasking abilities which I need for my
research.  My research project requires several processes writting to
the hard drive.

Has anyone else noticed a pattern of validation errors other than those
cause by programs terminating or crashing without closing files?

Did the old file system have this problem, and is it fixed in 2.0?

Without anything else to go by, I would venture to guess that under
certain circumstances two different files are allocated that same block;
as if the task that allocates the file block switchs out before it can
finish marking the block as allocated.
I would be especially suspicious if blocks are allocated sequentially.
That is, if two tasks are writing to two files simultaneously, would the
allocated blocks be interleaved something like:

-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+--
file1|file1|file2|file1|file1|file2|file1|file2|file1|file2|file2|file2|
-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+--

If this is the case, wouldn't it make more sense to start block allocation
in a sector where there currently isn't any disk activity when the file is
opened?  This would probably also help alleviate fragmentation.

_____________________________________________________________________
      ______________
____  | ^          |    Roger Nelson          rnelson@yoda.eecs.wsu.edu
\^^ |*| ^          |    Agricultural Engineering Department     ///
 |^^//  ^^         |    Computer Science Department            ///
 |  '  ^          +|    Washington State University        \\\///
 \_  ^    _________|    Pullman, WA 99164                   \XX/
   `-----'

jesup@cbmvax.commodore.com (Randell Jesup) (01/15/91)

In article <1991Jan13.213610.16726@eecs.wsu.edu> rnelson@yoda.UUCP (Roger Nelson - Grad Student) writes:
>I usually get the error when ever two applications access the HD at the 
>same time.  For example, I always get this error whenever I unarchive two 
>files simultaneously with lharc to the HD. 

	Are you certain lharc doesn't conflict with itself?  A number of
ported utilities are known to assume that only one copy of their program is
running (or running in a given directory).  Someone here has indicated to
me that lharc probably has this problem.

	The FS is quite reliable against large numbers of simultaneous
operations.  Unlike OFS, even re-use of the blocks of deleted files in a
directory while it's currently exnexting over them will not hurt FFS (the
classic design problem with ExNext).

-- 
Randell Jesup, Keeper of AmigaDos, Commodore Engineering.
{uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.commodore.com  BIX: rjesup  
The compiler runs
Like a swift-flowing river
I wait in silence.  (From "The Zen of Programming")  ;-)

rnelson@eecs.wsu.edu (Roger Nelson - Grad Student) (01/16/91)

------

>	Are you certain lharc doesn't conflict with itself?  

I ask then, at what level does lharc "conflict with itself". 
In what way could lharc "conflict with itself".

Apparently there is some conflict with some resource, Since lharc has
never crashed while running multiple lharc process to the ram disk, I
would have to conclude that the resource in question is the HD.

Is it not the responsibility of the File System to manage the HD resource?
If the File System is reliable against large numbers of simultaneous
operations, then there must be some flaw in the HD, the controller, or the
HD driver software?

Since it appears that others have had problems with multitasking lharc and
other utilities on HDs, we can probably (although not necessarily) rule out 
the HD itself and probably the controller as well.  This would leave the
device driver or bring us back to the file system.

> A number of ported utilities are known to assume that only one copy of their 
> program is running (or running in a given directory).  

Why shouldn't a program assume it is the only copy of itself running?
Why should this be any different than running several different programs?
Why should this cause problems? and why only with the HD.  What could
lharc be doing to cause the machine to crash in this manner?

By "running in a given directory" does this mean writing to a given
directory?  I assume that lharc is crashing during some sort of write
operation, since it would seem unlikely that there is a problem with
reads (I usually unarc from the HD to MEM).

Given the nature of what lharc does, I don't see that it would be doing
anything more complex opening a file and writing to it a character at
a time (or, most likely, a buffer at a time). Surely the Amiga can handle 
multiple process do that kind of operation.  It apparently can since I haven't
experience this problem with zoo.

I suppose I can resign myself to the idea that the problem rests totally
with lharc, but this leaves the question: what is lharc is doing
that should be avoided?  

The answer should be nothing if lharc is making standard system calls
correctly.

Has anyone experienced lharc crashing the machine whilst multiple lharc
processes are unarchiving to devices other that the HD or to other
file systems?

_____________________________________________________________________
      ______________
____  | ^          |    Roger Nelson          rnelson@yoda.eecs.wsu.edu
\^^ |*| ^          |    Agricultural Engineering Department     ///
 |^^//  ^^         |    Computer Science Department            ///
 |  '  ^          +|    Washington State University        \\\///
 \_  ^    _________|    Pullman, WA 99164                   \XX/
   `-----'

limonce@pilot.njin.net (Tom Limoncelli) (01/17/91)

In article <1991Jan16.040022.22135@eecs.wsu.edu> rnelson@eecs.wsu.edu (Roger Nelson - Grad Student) writes:

> Is it not the responsibility of the File System to manage the HD resource?
> If the File System is reliable against large numbers of simultaneous
> operations, then there must be some flaw in the HD, the controller, or the
> HD driver software?

I don't know if this is what happens with LHARC, but it is a simple
counter example to what you suggest:

The FS can only protect you to a certain (excuse the pun) extent.
What if a utility needs to create a temporary file.  It creates it as
T:lharc.tmp every time.  Running the utility multiple times will cause
problems.  This is the kind of thing I would expect from a program
ported from the PC.

I didn't like the one I found in Manx 3.6 so I wrote my own.  I
believe AmigaDOS 2.0 has a "create unique file name" system call.  
-- 
tlimonce@drew.edu     Tom Limoncelli      "Flash!  Flash!  I love you!
tlimonce@drew.bitnet  +1 201 408 5389        ...but we only have fourteen
tlimonce@drew.uucp    limonce@pilot.njin.net       hours to save the earth!"

jesup@cbmvax.commodore.com (Randell Jesup) (01/17/91)

In article <1991Jan17.032930.18196@eecs.wsu.edu> rnelson@yoda.UUCP (Roger Nelson - Grad Student) writes:
>I suppose I have said enough based on a lot of speculation.
>I should try some experiments to see if multiple process attempting to
>write to a single file cause the read/write error and ultimately the
>validation error. (or is this a known problem?)

	Please do try some things (make sure you have a good backup or two
first!)  If you hit problems, send a report with a much info as possible
(ks/wb version, hardware config, lharc version, what you were doing, etc).

>However, suppose it is the case that this situation does indeed cause
>the read write error.  I contend that the file system should recover
>more gracefully than crashing the machine and requiring one to reformat
>the HD partition.

	Merely colliding writing to a file won't cause problems, but if
lharc didn't consider the possibility of it's temp file being corrupted, it
might (note: might) die, trashing the filesystem's bitmap or other important
things.  I don't know that this is the case, or it could be some other
sort of bug in lharc.  I will probably do some tests here as well.

-- 
Randell Jesup, Keeper of AmigaDos, Commodore Engineering.
{uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.commodore.com  BIX: rjesup  
The compiler runs
Like a swift-flowing river
I wait in silence.  (From "The Zen of Programming")  ;-)

jesup@cbmvax.commodore.com (Randell Jesup) (01/17/91)

In article <1991Jan17.033827.18355@eecs.wsu.edu> rnelson@yoda.UUCP (Roger Nelson - Grad Student) writes:
>Granted a program should use a unique names generator, but is it
>unreasonable not to do so even in a multitasking environment?

	Yes, it's unreasonable, since your temp file might get corrupted
or deleted out from under your program.  It's one of the things one must
do to work properly in a shared environment (even single-tasking if you
have a network filesystem).

>If opening multiple same name files on the HD causes such serious
>problems, and is not an uncommon fax paus, shouldn't the file system simply 
>return a read/write error to the offending program and not allow the HD to 
>be corrupted?  Also, what is 'exnexting' and 'using memory after free'?
>Could someone provide me with a description?

	ExNext is the basic call used to examine a directory.  The full
explanation would get rather complex if you're not used to programming Amigas.

>> It may be lharc, or it may be something else (a 3rd-party utility
>> you have running, though this is quite unlikely).  
>
>I do not preclude the possibility that the problem is only manifest by
>my system configuration (I may very well have old release of 1.3 since
>I bought my machine used and it is a rev. 4.2, but I also find that unlikely).

	If you submit a bug report, include the version numbers (use the
version menu entry from WB, or "version" from a shell).

>This brings me back to my original posting:  because the machine crashed,
>I decided to repartition the HD.  Since I bought the machine used, I didn't
>get the documentation on the 2090a controller, and the Amiga Owner's
>manual (Introducing....) only mentions the 2090 controller; thus, I have
>no information on making an autobooting HD partition.  How do you go about
>making such a HD partition bootable?

	Simply prep it as you would a 2090.  Then reboot from a floppy WB,
and format the dh<whatever>: partition.  Then copy workbench to it (note
that it will be old filesystem, and slower than FFS).  Then reboot without
a floppy and it should boot off the HD.

	Your machine would have originally came with an install disk to
re-setup the disk.

-- 
Randell Jesup, Keeper of AmigaDos, Commodore Engineering.
{uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.commodore.com  BIX: rjesup  
The compiler runs
Like a swift-flowing river
I wait in silence.  (From "The Zen of Programming")  ;-)

gregg@cbnewsc.att.com (gregg.g.wonderly) (01/18/91)

From article <1991Jan13.213610.16726@eecs.wsu.edu>, by rnelson@eecs.wsu.edu (Roger Nelson - Grad Student):
> ------
> 
> Considering the number of people that have been complaining about the
> header corruption/validation error, it would appear that there is a real
> problem with the File System.

I used to have continual problems and frustration with having to backup
and restore my development partition.  Now, I too have moved to a
smaller partition which only requires a minimum of effort to backup and
restore.  There was a comment from one of the CA people that the
"Read/Write error" requester is only generated on a media error.  When
I diskdoctor such a partition, I get "multiple files reference block",
but after restoring (I backed up before diskdoctoring with a file by
file type backup program), not one single file is corrupted or
missing.  The data checks out flawlessly when I just read off of that
partition using my media checking program.  My personal opinion is that
the problem is in the FFS handler!

-- 
-----
gregg.g.wonderly@att.com   (AT&T bell laboratories)