[comp.unix.wizards] 4BSD file system structure

hinton@netcom.UUCP (Greg Hinton) (10/27/89)

I'm hoping some of you Wizards can help me decipher the structure
of the 4BSD file system.

First, what is the layout of the zero'th cylinder group?  According
to <sys/fs.h>, BBLOCK & SBLOCK are "absolute disk addresses".  Is
that CSRG's way of saying "sector numbers"?  Does the zero'th
cylinder group contain the usual complement of cg block, inode blocks
and data blocks?  If so, doesn't the presence of the boot block
throw things off?

Second, how does the on-disk inode (struct dinode in <sys/inode.h>)
indicate that the last block in a file is a fragment?

Thanks in advance for your help.

-- 
Greg Hinton
DOMAIN: hinton@netcom.uucp
UUCP: uunet!apple!netcom!hinton

chris@mimsy.umd.edu (Chris Torek) (11/09/89)

Since no one has answered this one yet (gee, why not? :-) ):

In article <3410@netcom.UUCP> hinton@netcom.UUCP (Greg Hinton) writes:
>First, what is the layout of the zero'th cylinder group?

Pretty much the same as any other.  There is no special case for it:

>According to <sys/fs.h>, BBLOCK & SBLOCK are "absolute disk addresses".
>Is that CSRG's way of saying "sector numbers"?

More or less.  These have been changed to BBOFF and SBOFF, in bytes,
so as to handle other sector sizes, but they still amount to 0 and 8K
bytes respectively.

>Does the zero'th cylinder group contain the usual complement of cg
>block, inode blocks and data blocks?

Yes.

>If so, doesn't the presence of the boot block throw things off?

No.  Each physical group of, e.g., 16 cylinders can be modeled thus:

	+----------------+
	|   data part 	 |
	| continued from |
	|  previous cg	 |
	|----------------|
	| cylinder group |
	|     stuff	 |
	| -- -- -- -- -- |
	|   data part	 |
	|  for this cg	 |
	+----------------+

Starting at the `cylinder group stuff', one moves downward some
distance to find the `data part'; this data part continues in the next
physical group (here the next 16 cylinders).  The last group usually
has less data than the rest, but the `missing' part that would normally
be in the next clump of 16 cylinders is straightforwardly represented
as `not free'.

The size of the `data from previous cg' is determined by the cyl group
number, and starts at 16K, leaving room for the boot block and the
master super-block.  It goes up by some amount for each physical group
of cylinders, so that the `cylinder group stuff' (which is, in order,
the super-block copy, the cylinder group data, the inodes, and then the
data) begins `further down' in each physical cg.  This puts its copy of
the super-block on a separate head, track, and/or cylinder from the
previous one.  With any luck, no matter what happens to the disk, at
least one super-block will remain intact, giving enough clues to
analyse the rest of the disk for files that also remain intact.

>Second, how does the on-disk inode (struct dinode in <sys/inode.h>)
>indicate that the last block in a file is a fragment?

It does not; it does not have to.  The `blksize' macro determines
the size of a file block, based entirely on the block's number and
the file's size:

#define blksize(fs, ip, lbn) \
	(((lbn) >= NDADDR || (ip)->i_size >= ((lbn) + 1) << (fs)->fs_bshift) \
	    ? (fs)->fs_bsize \
	    : (fragroundup(fs, blkoff(fs, (ip)->i_size))))

In other words, if the block number (we are talking 8K byte blocks if
this is an 8K/1K file system, e.g.) is >= NDADDR (12), or if this is
other than the very last block, the last block is not a fragment.  Otherwise,
the last block *is* a fragment; its size is the smallest number of
fragments that completely contains the last part of the file (after
all the previous full blocks are counted).  The last `fragment' can
have just as many fragments as a full block, in which case its size
is the same as fs->fs_bsize, but it could be smaller.

(Thus, files >= 48K (4k/any) or 96K (8K/any) never end in a fragment.)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris

rcodi@chudich.co.rmit.oz (Ian Donaldson) (11/14/89)

chris@mimsy.umd.edu (Chris Torek) writes:
>In article <3410@netcom.UUCP> hinton@netcom.UUCP (Greg Hinton) writes:
>>If so, doesn't the presence of the boot block throw things off?

I've always wondered why the boot block actually needs to be in
the filesystem at all?

I've set up a small system where the 'b' partition contains vmunix,
and /vmunix is really a block special file with permissions 644 that
points to that partition.

This way, all the PROM monitor need do is read the disk label from partition
'a', then read vmunix from partition 'b'.  vmunix then boots, and the
root lives on partition 'c' and the swap on 'd'.

ie: this sort of partition layout

	a:  label, bad block map if needed, substitute good blocks
	b:  vmunix  (the default bootable kernel)
	c:  root
	d:  swap
	e:  another vmunix if necessary
	f:  another vmunix if necessary
	g:  tmp
	h:  usr
	i:  other
	j:  other
	   etc

This has the advantages that:
	1) The primary boot program on disk is eliminated, and
	   also the secondary boot (usually /boot if it exists) is eliminated.

	   Also the silly size limitation of 8k bytes is eliminated too due to
	   the amount of space the filesystem allowed for it, usually
	   requiring a secondary boot program, /boot.  
	   
	   On SunOS 3.x, the primary boot program read /boot by interpreting 
	   the filesystem.  Fine.
	   
	   On SunOS 4.x, /boot doesn't know how to read the filesystem 
	   (probably because it would have overflowed the 8k size limitation) 
	   and has to have a hard-coded list of block numbers in it, set there 
	   by installboot(8).  This sucks to say the least.  When you move 
	   or copy to /boot you need to rerun installboot(8).  Not that 
	   you do this often however...)

	2) the prom monitor code only needs to know to read partition b
	   after reading the label to determine where it starts

	3) the label is always at absolute sector 0 and contains the
	   partition info

	4) vmunix boots blindingly fast as the monitor reads it
	   (eg: on a 68000/st506 combination it takes 2-3 seconds to
	   get it running!  This compares with about 10-15 seconds it
	   takes to read the various boot programs and vmunix on a Sun-3)

	5) makes the system more portable.  You don't need some
	   non-portable assembler code for the primary boot, and you don't need
	   some extra filesystem dependent code for the secondary boot,
	   enabling booting to be independent of the root filesystem type.
	   (4.4 bsd is even rumoured to have support for multiple filesystem
	   types)

This has the disadvantages that:

	1) /vmunix is not a regular file.  Not a real problem, but
	   purists may object.   Programs that need to know the
	   size of /vmunix won't be able to use stat(2) to find
	   this out.  However, nlist(3) doesn't need this information, so
	   programs like ps(1), pstat(1), w(1), size(1) won't need to 
	   be changed.

	2) the size of the partition to hold vmunix must be estimated
	   prior to installing.  About a megabyte is sufficient
	   in most cases for most kernels.  It is tunable in the
	   partition table anyway.

	3) you may need more than 8 partitions per disk do set your system up
	   (this isn't hard to change in the drivers)

	4) juggling the kernel you boot is not as simple as mv'ing the
	   name.  You need to copy into the device.  This is not
	   too difficult really.  You could even have a parameter
	   in the label that specifies an alternate default boot
	   partition, but programs such as ps(1) would need to read
	   /vmunix still, necessitating some sort of "special" indirect 
	   device that points to the boot partition if you want to do
	   it properly.

My guess that the 4.[23]BSD way of putting the boot block into the first
part of the filesystem stems from the fact that:

	1) at the time that the FFS was introduced, there was no
	   4BSD disk labels (except from some vendors such as Sun),
	   and the VAX prom may have booted VMS this way (I'm not
	   sure about this).  Partitions were hardcoded in the
	   drivers and thus more difficult to change.
	   The 4.3-tahoe release added the labels.

	2) there might be a limitation in the bootprom for the VAX
	   that prevents such a layout (I don't know, but it doesn't
	   sound like something that isn't easily solvable with
	   a prom change)

Care to comment Chris?

Ian D

m5@lynx.uucp (Mike McNally) (11/14/89)

rcodi@chudich.co.rmit.oz (Ian Donaldson) writes:

>I've always wondered why the boot block actually needs to be in
>the filesystem at all?

>I've set up a small system where the 'b' partition contains vmunix,
>and /vmunix is really a block special file with permissions 644 that
>points to that partition.
> . . .
>This has the disadvantages that:
> . . .

Another couple of disadvantages:

    Bad block in the vmunix partition.  I see no way the firmware could
    cope.  (Of course, if the disk is a decent SCSI drive, is will do
    its own remapping and this problem almost goes away.)

    I make a new kernel.  I install it in its special partition.  It
    fails because I'm slightly ignorant and screwed up.  I lose big.





-- 
Mike McNally                                    Lynx Real-Time Systems
uucp: {voder,athsys}!lynx!m5                    phone: 408 370 2233

            Where equal mind and contest equal, go.

rcodi@chudich.co.rmit.oz (Ian Donaldson) (11/16/89)

m5@lynx.uucp (Mike McNally) writes:

>rcodi@chudich.co.rmit.oz (Ian Donaldson) writes:

>Another couple of disadvantages:

>    Bad block in the vmunix partition.  I see no way the firmware could
>    cope.  (Of course, if the disk is a decent SCSI drive, is will do
>    its own remapping and this problem almost goes away.)

The firmware could process the bad block map in the 'a' partition
quite easily... in our systems its just next to the partition table.

Of course the only blocks that can't be fixed are the blocks containing
the bad block map unless you put a special case in for that (eg: if
the bad block map sits in a bad block then try a fixed list of other
blocks)

>    I make a new kernel.  I install it in its special partition.  It
>    fails because I'm slightly ignorant and screwed up.  I lose big.

No problem.  Just boot one you have backed up on another boot partition.
If you don't have one, go to your distribution tape and run standalone
copy to copy one in to some boot partition.  Its even easier than having
to load an entire root partition into the swap device just so you can
boot it to find a kernel to copy back to your real root!  (like you
do with SunOS)

Ian D

pechter@ocpt.ccur.com (Bill Pechter <pechter>) (11/17/89)

In article <1602@minyos.xx.rmit.oz>, rcodi@chudich.co.rmit.oz (Ian Donaldson) writes:
> chris@mimsy.umd.edu (Chris Torek) writes:
> >In article <3410@netcom.UUCP> hinton@netcom.UUCP (Greg Hinton) writes:
> >>If so, doesn't the presence of the boot block throw things off?
> 
> My guess that the 4.[23]BSD way of putting the boot block into the first
> part of the filesystem stems from the fact that:
> 
> 	1) at the time that the FFS was introduced, there was no
> 	   4BSD disk labels (except from some vendors such as Sun),
> 	   and the VAX prom may have booted VMS this way (I'm not
> 	   sure about this). 
> 	   The 4.3-tahoe release added the labels.
> 
> Care to comment Chris?
> 
> Ian D

I guess I ought to jump in here -- there was a boot prom on Vax 11/750's that
read in the boot block from cyl 0, sector 0, track 0 -- but that was the only 
Vax that did that.  The rest used the console floppy or TU58 DECtape to read
in a console program or loader that did the initialization.

It was a pretty limited (i.e. AWFUL) way to do things.  DEC had to go to 
the console tape mode on the 11/750 when they went to Vaxclusters...

However most 11/750 owners on Unix used the bootprom to boot...

There were up to 4 proms available -- labeled A through D.  There was
usually an RH780, RK07, TU78 and RA81 rom on the later Vax 750's.
The minimum was either RH or RK07 with TU58.

We did the rom boot when I was at Eaton on SysV Rel 2 and I know others using
4.2bsd did the same.  The TU58 took up to 3-5 minutes to load.  DEC clusters
running VMS also had to download microcode to the CI750.  Also, I seem to 
remember some SysV Unix sites had to put in a Translation Buffer patch 
sometimes from the TU58 on boot.

The 11/730 was worse -- it had 2 TU58's -- 1 with microcode, one with
console and boot.

Excuse faulty memory -- it's been about 2 years since my last 11/750 --
and about 3 since my DEC Field Service Days...

-- 
Bill Pechter -- Home - 103 Governors Road, Lakewood, NJ 08701 (201)370-0709
Work -- Concurrent Computer Corp., 2 Crescent Pl, MS 172, Oceanport,NJ 07757 
Phone -- (201)870-4780    Usenet  . . .  rutgers!pedsga!tsdiag!scr1!pechter
  **   MS-DOS is CP/M on steroids, bigger, bulkier and not much better  **

chris@mimsy.umd.edu (Chris Torek) (11/23/89)

In article <20@ocpt.ccur.com> pechter@ocpt.ccur.com
(Bill Pechter <pechter>) writes:
>... there was a boot prom on Vax 11/750's that read in the boot block
>from cyl 0, sector 0, track 0 -- but that was the only Vax that did that.

Yes, the 750 did this; but it is not the only one.  The 8200, for
instance, also reads block 0 and jumps to it; and microVAXen with
RD-series disks read block zero, but rather than executing the result,
they interpret it as a set of additional blocks to read, and other
mysterious things.

Other computer systems use similar bootstraps.  Still others (such as
the Power-6/32) do something even weirder: read a Unix file system,
open the file `/etc/fstab', and read the first line.  (Gak!) As a
result of the latter, the 4.3BSD-tahoe distribution for the tahoe
includes a `fake' first line in /etc/fstab, that names /dev/xfd0a.
The first few lines of /etc/fstab on one of the UCB machines:

	/dev/xfd0a:/: /			ufs	xx	1 1
	/dev/dk0a /			ufs	rw	1 1
	/dev/dk0b none			swap	sw	0 0
	/dev/dk0b /tmp			mfs	xx,-s=24000 0 0
	/dev/dk8b /tmp			ufs	rw	1 2
	/dev/dk3b /usr			ufs	rw	1 2

(/tmp exists twice; the `mfs' entry is fake, as yet.)

>It was a pretty limited (i.e. AWFUL) way to do things.  DEC had to go to 
>the console tape mode on the 11/750 when they went to Vaxclusters...

Actually, it is a fairly reasonable way to boot from a disk, since
it does not hard-code anything too terribly (unlike CCI's `assume you
have a 4.2BSD file system' scheme).  If *everything* is in ROM, it
becomes difficult to fix.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris

roy@phri.nyu.edu (Roy Smith) (11/23/89)

	Remember when you hit "IPL" to read an 80-word bootstrap into low
core and jump to it?
-- 
Roy Smith, Public Health Research Institute
455 First Avenue, New York, NY 10016
{att,philabs,cmcl2,rutgers,hombre}!phri!roy -or- roy@alanine.phri.nyu.edu
"The connector is the network"

gwyn@smoke.BRL.MIL (Doug Gwyn) (11/24/89)

In article <1989Nov23.033711.1913@phri.nyu.edu> roy@phri.UUCP (Roy Smith) writes:
>	Remember when you hit "IPL" to read an 80-word bootstrap into low
>core and jump to it?

Remember the panic at IBM when they thought that it might not be
possible to get a self-sustaining secondary bootstrap going using
this one-card IPL on the 1401?  (Some clever fellow finally
managed to do it.)