[comp.std.unix] Standards Update, ANSI X3B11.1: WORM File Systems

jsh@usenix.org (Jeffrey S. Haemer) (09/19/90)

Submitted-by: jsh@usenix.org (Jeffrey S. Haemer)

           An Update on UNIX*-Related Standards Activities

                            September 1990

                 USENIX Standards Watchdog Committee

          Jeffrey S. Haemer <jsh@usenix.org>, Report Editor

ANSI X3B11.1: WORM File Systems

Andrew Hume <andrew@research.att.com> reports on the July 17-19, 1990.
meeting in Murray Hill, NJ:

Introduction

X3B11.1 is working on a standard for file interchange on write-once
media (both sequential and non-sequential (random access)): a portable
file system for WORMs.  The fifth meeting was held at Murray Hill, NJ
on July 17-19, 1990.  We adopted a working paper and set to work on a
list of issues suggested by the chair.

Data Compression

Despite the huge capacities of WORM disks, people always want more.
Data compression is an easy way to supply more, and on current machine
architectures, probably can speed data access by trading CPU cycles
for I/O bandwidth.  Its main problem is that you need to support more
than one algorithm and thus, you need some way to specify algorithms.
This is a purely administrative issue, but luckily, it appears that X3
may soon act as a registry for compression algorithms (driven by the
need to register compression algorithms for IBM 3840 cartridge tape
work in X3B5).  (How does this fit in with the rumblings about
compress from POSIX.2?  I'm not certain.  I think part of becoming
part of the register means giving up patent rights or allowing liberal
licensing, but maybe not.  After all, the CD formats are now an ISO
standard, but I still think you have to be licensed to make them.)

Path Tables and Extended Attributes

Path tables were removed from the working paper.  We agreed to support
hard and symbolic links.  The next question was how to handle
``secret'' files: files primarily intended for system use.  Examples
might include the file describing free space, associated files (like
the resource fork of a Macintosh file), and extended attributes (of a
Microsoft HPFS file).  We agreed that the latter two cases should be
handled by regular files that probably are not in the directory tree

__________

  * UNIXTM is a Registered Trademark of UNIX System Laboratories in
    the United States and other countries.

September 1990 Standards Update        ANSI X3B11.1: WORM File Systems


				- 2 -

but are pointed to by the ``inode'' for a file.  (Note that this
implies there is a way to scan all the files in a volume set without
traversing the directory tree(s), analogous to running down the inodes
in UNIX.)

Given this, we have decided to support extended attributes as a
``secret'' or system file (and probably include pointers to things
like resource forks as those attributes).  This also gives us an
extensible way of handling non-standard or non-essential inode fields.
One of the important tasks remaining is to decide which fields are
more-or-less mandatory (such as modify time, owner) and which can
safely be pushed off into the extended attributes (access control
lists, file valid after date).  Please send us your suggestions!

Space Allocation and Management

We agreed that we have to support preallocating space for files,
freeing some or all of that space and then reusing that space for
other files.  After much discussion about extent lists and bit maps,
we compromised on a scheme based on extent lists (the details to be
worked by the working paper editor).  The idea is that is that the
free space is described by an extent list (of small but specifiable
size) of the ``best'' (probably largest) free spaces, and if this
overflows, ``worst'' free spaces are added to a system file
representing all the free spaces not in the above extent list.

Checksums

It was decided that all system data structures would include a 16 bit
checksum (CRC-16).  We anticipate that most errors would be transient
(cabling or memory) and not be media errors.

Multi-Volume Sets

I had thought the last meeting had settled just about all the
questions about multi-volume sets; I was wrong.  It took most of a day
to agree on these.

   - You have to have the last volume in order to grok the whole
     volume set (access any/all of the directories and files).

   - You can extend volume sets at any time.  This and the last item
     taken together imply the existence of ``terminal'' volumes (which
     can act as master volumes of a volume set) and ``nonterminal''
     volumes (the rest).  For example, if I extend a single-volume
     volume set by two volumes, then volumes 1 and 3 are terminal and
     volume 2 is not.

   - You can extract file data from any volume by itself.  This is
     meant only for disaster recovery (I dropped the master volume
     down the stairwell) and doesn't imply any requirements on

September 1990 Standards Update        ANSI X3B11.1: WORM File Systems


				- 3 -

     directory tree information (much as fsck restores unattached
     inodes to /lost+found).

   - Volumes can refer to data (say, extents) on other volumes (both
     earlier and later volumes).  Preallocated space on any volume in
     a volume set can be returned for future reuse.

   - The address space of logical blocks for the volume set will be 48
     bits; 16 bits for the volume number and 32 bits for the logical
     block number within a volume.  Media can be big (200GB helical
     scan media exist now) so 32 bits may seem barely big enough, but
     in such cases you can use a big logical block size.  For example,
     a logical block size of 16KB implies a limit of 64 terabytes per
     volume; this should be ample for a few years.

Defect Management

We spent a lot of time on this and learned a lot, but basically put it
off to the next meeting.  What we mean by ``defect management'' is
``How do we deal with write errors from the file system's point of
view?'' (We ignore the disk controller and the device driver, both of
which do some unknown amount of more-or-less transparent error
management.)

We discussed the ``sane'' approach: insert a layer between the file
system that handles errors, allowing the file-system code to assume an
error-free interface.  This apparently good idea is ruled out by
slip-sectoring, a (to my mind bogus) technique, which says, ``if
writing block n fails, then try subsequent blocks (n+1, n+2, ...)
until we succeed.'' Slip-sectoring is mainly used to enhance
performance (it does ensure that blocks are more-or-less contiguous),
and some disk controllers use it as their error-management technique.
(This really screws up your logical address space; it is legitimate
for a SCSI disk, your typical error-free, logical-address-space disk
interface, to write logical block 5 at physical block 5, then logical
block 1 at physical block 4 (1-3 were write errors), then disallow I/O
to logical blocks 2,3, and 4 because there is no place to put them -
these blocks just vanish!)

As preparation for the next meeting, Don Crouse, who deals mainly with
high-end machines like Crays and large IBMs, is writing a position
paper on performance, and members of the committee, many of whom are
drive manufacturers or integrators, are collecting estimates of error
rates we have to deal with.  (This matters; I see one bad block out of
100,000, but some people have used drives with a bad block in every
100.) The problem is that WORMs have really slow seek times, and when
you are pouring a 50MB/s Cray channel at a set of WORMs, you can't
afford to spend 1-2 seconds seeking to the bad block area.  I
personally think we should just do regular bad-block mapping (like
most SMD disk drivers) out of a special system file, and people with
performance concerns should arrange to have this space spread over the
disk.

September 1990 Standards Update        ANSI X3B11.1: WORM File Systems


				- 4 -

Endian-ness

A poll was taken of who really cared which way integer fields were
stored; the results were LSB - 1, MSB - 1, Don't Care - 11.  It is
awkward to specify one of LSB and MSB; this puts half the systems out
there at a competitive (performance) disadvantage (though I am
skeptical of whether it's significant).  Even though we're specifying
an interchange standard, the group felt that most interchange would be
between systems of the same endian-ness, so we should, somehow, allow
native byte order.  Accordingly, we agreed that endian-ness will be
specified in the volume header (for the whole volume set).  In
retrospect, I think this was silly; we should have just picked one
way.  In order that everyone important be evenly disadvantaged, we
could have used some byte order like 3-0-1-2 that no one uses.

Finale

The committee is trying to nail down a firm proposal for balloting.
We anticipate a substantial amount of change at the next meeting (Oct
16-18 in Nashua, NH) and have reserved time (Dec 11-13, but no place)
for an additional meeting so that we can ballot after the following
meeting (Jan 29-31, Bay area).  We now have a working paper (available
by the end of September or so); I think it likely we can meet this
schedule, but who knows.

Anyone interested in attending any of the above meetings should
contact either the chairman, Ed Beshore (edb@hpgrla.hp.com), or me
(andrew@research.att.com, research!andrew, (908)582-6262).  I am also
soliciting your comments on necessary inode fields and defect
management.  I will present anything you give me at the next meeting.

September 1990 Standards Update        ANSI X3B11.1: WORM File Systems


Volume-Number: Volume 21, Number 116

jsh@usenix.org (Jeffrey S. Haemer) (03/27/91)

Submitted-by: pc@hillside.co.uk (Peter Collinson)

An Update on UNIX-Related Standards

ANSI X3B11.1: WORM File Systems

USENIX Standards Watchdog Committee
Jeffrey S. Haemer <jsh@usenix.org>, Report Editor


March 26, 1991


Andrew Hume <andrew@research.att.com> reports on the January 22-24, 1991
meeting in Murray Hill, NJ:

Introduction

X3B11.1 is working on a standard for file interchange on write-once
media (both sequential and non-sequential, i.e., random access): a
portable file system for WORMs.  First let me apologize for laggardly
snitching; we have had an extra meeting (in December) to accelerate our
progress with the draft proposal and I have been busy writing a
programmer's guide to the draft proposal.  I shall describe the results
of the last three meetings, October (Nashua, NH), December (Murray
Hill, NJ), and January (San Jose, CA), not in chronological order, but
rather as a summary of where we are now.  Although many details remain
to be ironed out, we have broad agreement on the current proposal.

Multi-volume file systems

The draft proposal supports multi-volume file systems.  To avoid the
confusion that reigned at our meetings, I will define what this means.
A volume is a logical address space (on some medium).  Thus, a typical
WORM disk is two volumes, as each side is addressed separately.  A
volume partition is simply a contiguous subset of a volume's address
space.  A logical volume is simply a set of (volume) partitions upon
which a file system is recorded.  Finally, a logical volume set is a
set of volumes with a single volume set identifier.  (That is, it is
simply a publishing concept.) Note, however, that when I say file
system, I mean a set of files and directories described by possibly
multiple directory hierarchies (typically each would be in a different
character set).  The (logical) block size, not the physical sector
size, is $2 sup i$ bytes, $ 9<=i<65536$, and implementations would
have to support at least a block size of 64KB.  The various size
limits are generous; internal block addresses allow 64K volumes, 64K
partitions per volume, and $2 sup 32$ blocks per partition.

Volume Headers

The location of the volume header (the analog of the superblock) is a
tricky issue because of the requirement that systems be able to boot
off a disk in our format and there is simply no consensus on the size
or location of the boot area.  Accordingly, pointers to the volume
header (actually a sequence of various descriptor records) are
recorded at one or more of 0, 16, 64, 128, 192, 256, $N - 16$, $N - 4$
(where $N$ is the size of the disk).  The seek speed (or rather the
lack of seek speed) of WORM disks encouraged us to put these at both
ends of the disk.  The volume header record, like all the other major
control structures, has a 16-bit CRC and a unique 8-byte tag, which
should prevent misrecognition.

Volume/Partition Structure

The volume layer handles space allocation for the volume, definitions
of partitions, and bad-block mapping.  The partition layer does its own
space allocation, supports the file system, and does partition-access
logging.  Partitions have file-system-type tags; the intent is to allow
partition $w$ to be an X3B11.1 file system, partition $x$ to be a CDROM
file system, partition $y$ to be an MS-DOS floppy file system and
partition $z$ to be of unknown type.  There should be a registry for
this type field; vendors may want to register their file-system
formats.

Bad-Block Handling

A simple defect-management scheme has been adopted; it is similar to
the bad-block remapping scheme used for most SMD disks.  There was
considerable resistance to such a scheme, particularly from the
representatives of the hardware vendors, as the (SCSI) WORM disks
already do as much error detection/correction as is possible.  However,
defect management (above the disk driver level) is still necessary
because

  1.  error correction/detection in the drive can, and for performance
      reasons often is, turned off,

  2.  errors can easily occur between the disk and the host's main
      memory (have you ever heard of DMA or bus errors?), and

  3.  even though SCSI disks present an ``error free'' interface, most
      drives have a limited number of errors they can cope with, and
      many early drives did little or no error correction.

FCB Format

As you may recall, multiple versions of the direct entry (the
equivalent of the inode) are stored in a data structure called the
file control block (FCB).  The original proposal involved various
levels of indirect blocks exactly like classic Unix file systems.  We
adopted my proposal (adapted from an observation by Dennis Ritchie)
for a simpler, more general format that allows arbitrary structures,
which can be specialized for different applications.

Partition Access Records

This is more like logging changes to the file system than a security
thing like access control lists.  The idea is to have periods of
writing to the partition bracketed by specific control records so that
it will be possible to tell if a system closed out that partition
gracefully.  (More bluntly, did we unmount the partition gracefully or
did the system crash in the middle of a session?) These records are
kept on a per- file-system basis and are recorded as variants of
direct entries in a structure identical to FCBs.  Another side issue
is support for a so called ``stable'' record, which is analogous to
the proposed stable sync feature of BSD Unix.  (The control structures
such as inodes and indirect blocks are written to disk but the user's
data may not be, yet.) This peculiar state avoids the need to run fsck
(or its equivalent) on the disk but you still have to get the user's
data from somewhere.  [Ed: does anyone really need this ``stable''
state?]

Recording Directories

For performance reasons, it is proposed that directories, or rather the
records (FIDS) identifying the files (and subdirectories) in that
directory, be kept in optionally sorted order.  This would be in binary
and not lexicographic order (thus evading nettlesome character-set-
collating-order issues).  It is not trivial to support this but is
probably worth it.  Related to this is the issue of system areas in
directories and FIDs.  It is expected that these areas will contain
accelerator structures, such as B-tree indices and so on.  Here, and
elsewhere in the standard, the governing principle is to allow systems
to use such structures but to neither mandate nor standardize their
use.

Anonymous Files

There are numerous FCBs, or file-like objects, that have no FID.  An
example might be a Macintosh resource fork.  The question is whether to
make these visible to the user.  This is a serious issue, and one not
confined to this standard.  It is an issue for the system supporting
access to the file system on the disk.  Do we rely on this system to do
the right thing or should we mandate a mechanism?  For example, take
the example of a Macintosh file (with its resource fork) on a system
(say Unix) that doesn't have that concept.  We can either trust that
the vendor supplying your Unix has implemented an fcntl (or ioctl) to
access the resource fork, or we can evade the issue completely by
mandating that the resource fork be available for normal access by a
reserved name such as foo.RFORK.  The general feeling is that users
will not allow a standard to reserve parts of the file name space for
its own use.  Thus, it seems likely that access would have to be via
standardized fcntl calls, but these are outside the scope of our
standard.

Byte Order

I have pressed the issue of the byte order for numeric fields.  The
previous notion was to allow the recording system to choose the byte
order.  The issue is not technical (everyone seems happy to pick just
one and stick with it) but political.  We picked LSB order: the order
used by the low-end (and slowest) systems.  We measured the performance
degradation for low-end MSB systems (the slowest Macintosh we could
find), and the CPU cost of straightforward C code.  Interpreting the
byte order for the worst case (a block of integer block numbers) was
about 10ms - comparable to doing a single disk I/O and one or two
orders of magnitude less than the cost of doing a disk seek.  (Careful
assembly code would be much faster than this.)

Extended Attributes

The direct entry for a file has many attributes or fields.  Some of
these will be faster to access and be stored directly in the direct
entry.  The rest will be stored in an extended attribute record area
much like resources in a Macintosh resource fork.  There are two
issues:  which attributes get faster access and how do you access the
other attributes?  The former is something the standard specifies; our
guiding principle was to include the fields needed for a Unix stat or
an MS-DOS (or VMS) dir command.  Unfortunately, the issue of access is
beyond the domain of our standard and needs to be addressed by POSIX,
probably best by 1003.8.  Internally within our standard, the extended
attributes are identified by a 32-bit number, some of which are set in
the standard and the rest by a registry maintained by some authority
(like ANSI).  The current list of extended attributes is given below;
treat it as very preliminary and subject to change.

     information creation         file abstract
     information modification     file type
     information expiration       associated file
     information effective        data compression
     file creation                protection
     file access                  application-specific data segment
     file modification            implementation segment
     file backup                  escape sequences segment
     file expiration              action history
     file attribute               icon
     file effective               environment type

Character Sets

We have adopted a somewhat simpler way of dealing with character sets
than the CD-ROM standard (ISO 9660).  The current schemes available are
 ----------------------------------------------------------------------
|   0|   0-9A-Z . from Latin-1 (ISO 8859-1),                          |
|   1|   portable filename character set 0-9A-Za-z .- (POSIX 1003.1), |
|   2|   $G sub 0$ set from Latin-1,                                  |
|   3|   all graphic characters from Latin-1, and                     |
| 255|   defined via escape sequences - the full scale mechanisms     |
|    |    of ISO 2022, which are only rarely implemented.             |
 ----------------------------------------------------------------------

International Activity

The appropriate ISO committee (SC15) has been reconstituted with Japan
supplying secretariat duties.  A meeting is expected in July or
September and it is hoped that there will be close cooperation between
X3B11.1 and SC15.  There is some concern that ANSI might awaken the
long-dormant file structure committee and that this might delay
acceptance of X3B11.1's work.  Also, because of a request by a working
group involved in the Philips CD-WO device (a combination medium that
is a 5.25in WORM with a CD-ROM portion), ECMA might also reconstitute
its file structure committee (TC15).

Finale

What can, or should, you do?  As always, I welcome any feedback,
specific or general on the work our committee does.  (I must express my
appreciation to USENIX for publishing these reports; nearly all the
mail I have received about X3B11.1's work starts off like, ``I read
your report in the so-and-so login;''.) In particular, I invite
comments on any fields or attributes you would like standardized and -
perhaps more important to the Unix community - how to access auxiliary
information about a file in a standard way.  Plenty of ad hoc
solutions already exist for the cases of versioned files (VMS file
systems on Ultrix systems), Macintosh files mounted as NFS file
systems, and CD-ROM file systems.  The number of these problems will
certainly increase over time; we need to address the solutions now
before we standardize on file system interfaces (such as 1003.8) that
omit such mechanisms.

If you would like more details on X3B11.1's work, you should contact
either me (andrew@research.att.com, (908) 582-6262) or the committee
chair, Ed Beshore (edb@hpgrla.hp.com).  I think the two most useful
documents are the current draft of the working paper (about 80 pages)
and a programmer's guide to the draft (about 12 pages written by me).
I will send you copies of the latter document; requests for other
documents or more general inquiries about X3B11.1's work would be best
sent to Ed Beshore.

The next meeting is in North Falmouth, MA on April 23-26, 1991.  Anyone
interested in attending should contact either me or Ed Beshore.



Volume-Number: Volume 23, Number 22