[comp.unix.large] Epoch like filesystem

rodney@sun.ipl.rpi.edu (Rodney Peck II) (10/13/90)

Say, I was thinking the other day about how nice it would be to have an
Epoch file server here in the lab, but we already invested $150,000 in
a file server with 2.5 gig on it.  Instead of spending another $70,000
for the basic Epoch system, it would be really nice to use what we have.
I was thinking about how annoying it was that the salesman said we couldn't
use the 8mm tape drive we already have with the potential new Epoch
simply because we didn't buy it from them, and it hit me...

Why not write some code to make a standard sunos system behave like Epoch?

Basically, this means that the hard disk is paged off to optical disk as
it fills up.  Then, when the pages that are on the optical disk are
referenced, they are brought back on line.  The directory information
stays on the hard disk at all times, making things like "find" run at
a reasonable speed.

This could be done the right way (rewrite the kernel with proper paging
schemes and that sort of thing) or the easy way.

The easy way would be to do a find of the tree and make things that are
really old into symbolic links to the new file system on the optical
disk.  Could be a shell script?

So, this sort of thing would let you by an inexpensive optical jukebox
(you can get the 10 platter 600 meg version for somewhere around $10k
I believe), hook it up, and have a pseudo epoch filesystem.

I'd like to hear what you think of this idea -- I'm planning to try it
out once our jukebox arrives (we're getting the 10 platter 1 gig/platter
version).  I think we could put aside two platters for an experiment in
this wacky idea.

anxiously awaiting the intelligent unix answers....


-- 
Rodney

chowe@bbn.com (Carl Howe) (10/16/90)

rodney@sun.ipl.rpi.edu (Rodney Peck II) writes:

>Why not write some code to make a standard sunos system behave like Epoch?

>Basically, this means that the hard disk is paged off to optical disk as
>it fills up.  Then, when the pages that are on the optical disk are
>referenced, they are brought back on line.  The directory information
>stays on the hard disk at all times, making things like "find" run at
>a reasonable speed.

>This could be done the right way (rewrite the kernel with proper paging
>schemes and that sort of thing) or the easy way.

>The easy way would be to do a find of the tree and make things that are
>really old into symbolic links to the new file system on the optical
>disk.  Could be a shell script?

Plan 9 from Bell Labs does a similar sort of thing.  They back up the
entire contents of their hard disk to optical every night as part of
the standard file system tree.  For example, all files created or changed
today would end up in the file system under /1990/1015/...  Yesterday's
files would be under /1990/1014/....  The hard disk on any given day
only contains the changes since the last optical backup.  You end up
with both fast access to recently created data and on-line access
to all backups.

Of course, they have a separate file server machine to do all this magic.
You'd have to modify the file system to make all the copy on write stuff
work correctly.  However, perhaps you could simply make it a new
file system type and run it off the file system switch stuff.  That way
it'd be modular and you could still track OS updates.  Regardless, it
seems like a pretty neat idea.

Carl

renglish@hplabsz.HPL.HP.COM (Bob English) (10/16/90)

In article <QAX%GY^@rpi.edu> rodney@sun.ipl.rpi.edu (Rodney Peck II) writes:

>Why not write some code to make a standard sunos system behave like Epoch?
>...  The easy way would be to do a find of the tree and make things that are
>really old into symbolic links to the new file system on the optical
>disk.  Could be a shell script?

I see two potential problems with this system.  One is minor, the other
potentially troublesome.

Having kernel support for this type of thing gives you reasonably
complete integration into your file system and quick response to
changes.  By the first, I mean that the file is actually a file, stored
in the file system, backed up by the standard tools, and easily
deletable.  Assuming that you have people there who can fill up your
jukebox, then you're eventually going to have to expunge or archive the
stale data from the jukebox, as well.  When that happens, you're going
to have to do something to keep the name/data associations, and you may
wan tot delete the dangling links.  Probably not a killer, but something
to think about.

A bigger problem will be what to do with a file starts getting used
again after being pushed to the jukebox.  As soon as this starts to
happen, you're going to want the data back on the disk, and the symbolic
link approach isn't going to give you that.  If the access rate into the
jukebox isn't too high, you might be able to write a nightly script to
pull files back to the disk, but you'll never know when your jukebox is
about to thrash.

Another minor problem is that this system will appear transparent to the
user, but won't actually be.  Since it won't be able to free up disk
space on demand, the files on the jukebox may not be able to migrate
back without a lot of effort.  What do you do if a large file gets moved
to the jukebox while someone's on vacation and the disk fills up so that
he or she can't move it back?

I don't mean to discourage you from doing this.  It sounds like fun, and
it sounds useful, but you may run into some headaches down the road.
Using the same shell script to ask people to archive voluntarily might
give you most of what you're looking for without causing the same
problems.

>anxiously awaiting the intelligent unix answers....

Me, too.

--bob--
renglish@hplabs

cudcv@warwick.ac.uk (Rob McMahon) (10/16/90)

In article <60058@bbn.BBN.COM> chowe@bbn.com (Carl Howe) writes:
>[In Plan 9] They back up the entire contents of their hard disk to optical
>every night as part of the standard file system tree.  For example, all files
>created or changed today would end up in the file system under /1990/1015/...
>Yesterday's files would be under /1990/1014/....  The hard disk on any given
>day only contains the changes since the last optical backup.  You end up with
>both fast access to recently created data and on-line access to all backups.

Surely you want the recently *accessed* data, not just the recently *created*
data on hard disk ?  Otherwise your system's going to be a bit slow.

>Of course, they have a separate file server machine to do all this magic.
>You'd have to modify the file system to make all the copy on write stuff work
>correctly.  However, perhaps you could simply make it a new file system type
>and run it off the file system switch stuff...

This sounds like an ideal application for SunOS 4.1's Translucent FileSystem
(TFS), where you can have a stack of filesystems mounted one over the other,
with the ones below showing through the holes in the ones above.  Just mount
your hard disk on top of your /1990/1015 system on top of your /1990/1014
system...  When you alter a file it is automatically copied to the top level.
I think you'd still have to have some mechanism to copy recently accessed
stuff up to the top level too.  Is anyone doing anything like this ?

Rob
--
UUCP:   ...!mcsun!ukc!warwick!cudcv	PHONE:  +44 203 523037
JANET:  cudcv@uk.ac.warwick             INET:   cudcv@warwick.ac.uk
Rob McMahon, Computing Services, Warwick University, Coventry CV4 7AL, England

hutch@fps.com (Jim Hutchison) (10/17/90)

In article <60058@bbn.BBN.COM> chowe@bbn.com (Carl Howe) writes:
>rodney@sun.ipl.rpi.edu (Rodney Peck II) writes:
>>[...deleted...]
>>Why not write some code to make a standard sunos system behave like Epoch?
>>[...deleted...]

>Plan 9 from Bell Labs does a similar sort of thing.  They back up the
>entire contents of their hard disk to optical every night as part of
>the standard file system tree.  For example, all files created or changed
>today would end up in the file system under /1990/1015/...  Yesterday's
>files would be under /1990/1014/....  The hard disk on any given day
>only contains the changes since the last optical backup.  You end up
>with both fast access to recently created data and on-line access
>to all backups.

O.k. I'll bite, how does such a system last?  At the end of each month does
it make a "level 0 dump" file system?   Certainly at some point the on-line
optical disk resources will be exhausted and some sort of consolidation will
be needed, or the system will have to grow continuously at a rate determined
by day-to-day disk activity.  It seems that consolidation could limit this
growth, and allow for old "level 0 dump" disks to be migrated onto a shelf
or into a safe place.

With the current speeds for optical drives, I'd kind of guess this system
is not useful as a primary storage device.  Presuming WORM and not M-O,
it couldn't be used for frequent migration, due to the rapid rate at which
the platters would fill up with minor revisions.
--
-
Jim Hutchison		{dcdwest,ucbvax}!ucsd!fps!hutch
Disclaimer:  I am not an official spokesman for FPS computing

davecb@yunexus.YorkU.CA (David Collier-Brown) (10/17/90)

rodney@sun.ipl.rpi.edu (Rodney Peck II) writes:
>Why not write some code to make a standard sunos system behave like Epoch?

hutch@fps.com (Jim Hutchison) writes:
>In article <60058@bbn.BBN.COM> chowe@bbn.com (Carl Howe) writes:
>Plan 9 from Bell Labs does a similar sort of thing.  They back up the
>entire contents of their hard disk to optical every night as part of
>the standard file system tree.

  Er, Mutlicks had a requirement for stable storage too, back in the prehistory
of Unix...
   One's files were migrated automagically to a storge medium after
creation or change. Perhaps someone old enough could comment on this?

--dave
[One of the requirements of a timesharing service is that one can feed
confident in keeping one's **only** copy of data online: it is a system
requirement to checkpoint/journal/backup at fairly fine intervals to prevent
losses of significant work.  That was interpreted as ``a few hours at
worst'' in the Multics requirements spec]

-- 
David Collier-Brown,  | davecb@Nexus.YorkU.CA, ...!yunexus!davecb or
72 Abitibi Ave.,      | {toronto area...}lethe!dave or just
Willowdale, Ontario,  | postmaster@{nexus.}yorku.ca
CANADA. 416-223-8968  | work phone (416) 736-5257 x 22075

gdtltr@freezer.it.udel.edu (Gary Duzan) (10/17/90)

In article <11709@celit.fps.com> hutch@fps.com (Jim Hutchison) writes:
=>In article <60058@bbn.BBN.COM> chowe@bbn.com (Carl Howe) writes:
=>>rodney@sun.ipl.rpi.edu (Rodney Peck II) writes:
=>>>[...deleted...]
=>>>Why not write some code to make a standard sunos system behave like Epoch?
=>>>[...deleted...]
=>
=>>Plan 9 from Bell Labs does a similar sort of thing.  They back up the
=>>entire contents of their hard disk to optical every night as part of
=>>the standard file system tree.  [ ... ]
=>
=>O.k. I'll bite, how does such a system last?  At the end of each month does
=>it make a "level 0 dump" file system?   Certainly at some point the on-line
=>optical disk resources will be exhausted and some sort of consolidation will
=>be needed, or the system will have to grow continuously at a rate determined
=>by day-to-day disk activity.  It seems that consolidation could limit this
=>growth, and allow for old "level 0 dump" disks to be migrated onto a shelf
=>or into a safe place.
=>
=>With the current speeds for optical drives, I'd kind of guess this system
=>is not useful as a primary storage device.  Presuming WORM and not M-O,
=>it couldn't be used for frequent migration, due to the rapid rate at which
=>the platters would fill up with minor revisions.
=>--

   I'm no expert, but I did read some on it. Plan 9 uses a WORM Jukebox,
which can hold quite a bit of data. The Hard Disk can be viewed as a
cache for the Jukebox. That way you get speed, size, and reliability.

                                        Gary Duzan
                                        Time  Lord
                                    Third Regeneration



-- 
                          gdtltr@freezer.it.udel.edu
   _o_                    --------------------------                      _o_
 [|o o|]        An isolated computer is a terribly lonely thing.        [|o o|]
  |_O_|         "Don't listen to me; I never do." -- Doctor Who          |_O_|

craig@attcan.UUCP (Craig Campbell) (10/17/90)

In article <11709@celit.fps.com> hutch@fps.com (Jim Hutchison) writes:

>With the current speeds for optical drives, I'd kind of guess this system
>is not useful as a primary storage device.  Presuming WORM and not M-O,
>it couldn't be used for frequent migration, due to the rapid rate at which
>the platters would fill up with minor revisions.
>--
>-
>Jim Hutchison		{dcdwest,ucbvax}!ucsd!fps!hutch


There are now available Read Write optical drives.  These are not WORM
drives.  Somehow, the plater information is re-programmable.  Optical disks
now are useful as a primary storage medium.  I do not know $$ or details.
(Sorry :-))

craig

rodney@sun.ipl.rpi.edu (Rodney Peck II) (10/17/90)

In article <QAX%GY^@rpi.edu> I said:
>
>The easy way would be to do a find of the tree and make things that are
>really old into symbolic links to the new file system on the optical
>disk.  Could be a shell script?
>
>I'd like to hear what you think of this idea -- I'm planning to try it
>out once our jukebox arrives (we're getting the 10 platter 1 gig/platter
>version). 

Well, after extensive and tiring flipping through the trade magazines and
calling dealers, I found out some more information.  10 disk 600meg erasable
disks are very popular and are standardized.  The 1gig isn't and is only
made by maxstor.  It could be completely non-interchangable inside of a
year.

also, the 10 disk jukebox normally comes with no software to let it work
reasonably with the filesystem.  For example, consider what happens when
you have say, 6 platter sides mounted and a sync command comes along.
Remember that there is only one disk drive, 10 disks, each with two sides,
and it takes an average 6 seconds to mount a side.  could be a big performance
problem there since update sends out sync commands all the time.  So, 
most of the vendors don't have support software, just the jukeboxes.

A company that I would like your comments on, sells a package that has
a jukebox, and some caching software that handles all the filesystem stuff.
It takes up 30 or so meg on your machine to work as the cache and directory
space for the jukebox.  It keeps track of all the files in the library
including disks that aren't in the machine at that time.  So, you could have
any amount of data appear to be present on the filesystem.  Requests for
files on disks in someone's desk cause mount requests to the operator to 
be made.

The software allows for security type things to prevent users from reading
files that aren't theirs and all of that.

The system doesn't move older files off the main filesystem like epoch does,
however.

Basically, this seems like a decent package -- does anyone have any horror
stories to share about this company (R squared)?  The whole thing comes
to $17,000 including 10 disks (a $2000 value) and the software and they
fly out to install it.


-- 
Rodney

rodney@dali.ipl.rpi.edu (Rodney Peck II) (10/18/90)

In article <12778@vpk4.UUCP> craig@vpk4.ATT.COM (Craig Campbell) writes:
>In article <11709@celit.fps.com> hutch@fps.com (Jim Hutchison) writes:
>
>>With the current speeds for optical drives, I'd kind of guess this system
>>is not useful as a primary storage device.  Presuming WORM and not M-O,
>>it couldn't be used for frequent migration, due to the rapid rate at which
>>the platters would fill up with minor revisions.
>>--
>
>There are now available Read Write optical drives.  These are not WORM
>drives.  Somehow, the plater information is re-programmable.  Optical disks
>now are useful as a primary storage medium.  I do not know $$ or details.

I do... They are about $200 per 5.25" 600 meg platter.  A ten disk juke
box costs about $10,000 and is a scsi device.  The problem is lack of
software -- this 6 gigabyte thing acts like some sort of large dumb scsi
object without the proper software.

And now some engineering...
The way the read/write optical disk works is by taking advantage of two
nifty things.  First, some materials change color depending on the magnetic
field around them.  Second, the hysteresis curve of magnetic material
narrows when the material is heated.

The problem with high density conventional disks is that you need to get
a really tiny strong electomagnet really close to the disk to make a bit
small enough that it doesn't interfere with its neighbors, and at the
same time allows a lot of them on the disk.  This leads to lots and lots
of expensive engineering problems.

R/W optical gets around this in the following manner.  First, you take some
magentic material and coat it with some stuff that changes it's reflectivity
when in a magnetic field.  That's the "optical" disk.

Now, build a hard disk the usual way, but use a nice big head that is easier
to produce and is more stable than a high-tech floating head.  Install a
pointable laser that is powerful enough to heat a single spot on the disk
very very quickly.

Now, when you want to write a bit, you turn on the head with the proper
polarity to a level that is just below the amount needed to force the
cold media to the other side of the hysteresis curve.  Then, you zap the
tiny spot you want to flip with the laser.  It's threshold for flipping
drops because the curve narrows from the heat.  The data has been written.

To read the data, you shine a less powerful laser on the disk and measure
the reflected power.  1's will be one color, 0's the other.

and that's how these things work.  nifty, eh?

I hope you found today's lecture informative.  Remember, there is an
F test this Friday at 8 am, and none of the TA's speak English and I'll
be out of town for two weeks so there are no office hours.  Thank you.

((you realize that RPI turns out very large numbers of computer scientists
who understand electrical engineering))
(((now's the part where everyone tells me how wrong I am.  brace yourself)))
-- 
Rodney

craig@attcan.UUCP (Craig Campbell) (10/18/90)

In article <N~1%AW*@rpi.edu> rodney@dali.ipl.rpi.edu (Rodney Peck II) writes:
>In article <12778@vpk4.UUCP> craig@vpk4.ATT.COM (Craig Campbell) writes:

>>There are now available Read Write optical drives.  These are not WORM
>>drives.  Somehow, the plater information is re-programmable.  Optical disks
>>now are useful as a primary storage medium.  I do not know $$ or details.
 
>I do... They are about $200 per 5.25" 600 meg platter.  A ten disk juke
>box costs about $10,000 and is a scsi device.  The problem is lack of
>software -- this 6 gigabyte thing acts like some sort of large dumb scsi
>object without the proper software.

The one we have is not a juke box but rather uses removable optical disks
one at a time.  The disk capacity is 300 meg/side (I think).

craig


P.S.  Yes, it is SCSI.

c.e.c.

msawyer@hokulea.hig.hawaii.edu (Michael Sawyer (REU)) (10/19/90)

In article <12795@vpk1.UUCP> craig@vpk1.ATT.COM (Craig Campbell) writes:
>In article <N~1%AW*@rpi.edu> rodney@dali.ipl.rpi.edu (Rodney Peck II) writes:
  [...]
>>I do... They are about $200 per 5.25" 600 meg platter.  A ten disk juke
>>box costs about $10,000 and is a scsi device.  The problem is lack of
>>software -- this 6 gigabyte thing acts like some sort of large dumb scsi
>>object without the proper software.
>
>The one we have is not a juke box but rather uses removable optical disks
>one at a time.  The disk capacity is 300 meg/side (I think).
>
We have one of the 10 disk jukebox systems on our Sun system, and have
had less than ideal luck with it.

For one thing, it uses a window based program to control the mounting
and dismounting of the disks loaded into the machine.  This means
that, as the software stands now, an operator MUST go to the console
and select which disk is being used.  There is no way to do this
automatically.  (What I would like to see is a method of having 20
mount points (2 sides/disk.  The thing's robotic arm actually flipps
the disk over when you use side B!) per disk.  That way, I can write a
file to /jb1a to get to disk, then /jb6b to use disk 6, side b.  I
realize that this would have to work in a method similar to the
automounter, where the system sits there waiting for a request for one
of the disks, loads it, and mounts it at the appropriate point.  The
hardest point here is when two users ask for different disks.  I may
end up trying to write something to do this, and lock out other users
when anyone has access to the drive.  For what we require, this will
probably be fine...

Also, I don't know how goot the quality control on these devices is at
present.  Our dept. has the jukebox we own as well as the standalone
one disk drive.  According to the people using the other system, they
had a number of problems getting it working, and ours had a defective
eject circuit (I had to take the D*** thing apart to get the disk
out!).  The company blamed it on their drivers, and wouldn't take it
to be repeired for almost a month!

When I did have the disk in the drive, I copied quite a few files onto
it, deleted them, and so forth.  The access time wasn't like a hard
drive by any means, but it wasn't unreasonable either.  (Sorry, I
didn't do any benchmarks.)

I am by no means an expert on this system or Unix (science comes
before system management), but I do have some idea what's going on...
Don't take what I say as the absolute truths.


---
return mail to: msawyer@io.soest.hawaii.edu
Michael Sawyer, Univ of Hawaii Physical Oceanography
(They don't even know I am using rn, so I sure don't speak for UH!) 

dick@cca.ucsf.edu (Dick Karpinski) (10/20/90)

In article <L$1%B%^@rpi.edu> rodney@sun.ipl.rpi.edu (Rodney Peck II) writes:
>In article <QAX%GY^@rpi.edu> I said:
>>... once our jukebox arrives (we're getting the 10 platter 1 gig/platter
>... 10 disk 600meg erasable disks are very popular and are standardized.
>... $17,000 including 10 disks (a $2000 value) ...

That works out to 6,000 megabytes for $17,000 or about $3/meg online
but you can do just as well with high capacity fixed disks and have
no 6 second delays etc.  The very largest WORM jukeboxes bring that
cost/meg down below $1, but the entry cost is >$200,000.  I can 
understand using removable platters, but the jukebox doesn't yet 
make sense to me.  Why would you want the extra hassles and delays
when it doesn't result in any cost savings?

On the other hand, a generic Epoch like filesystem makes a lot of 
sense to me, effectively replacing both backup and archiving with a
more capable and self-maintaining approach.  

I'm holding out for LaserTape using digital paper ($0.01/meg) in
3480 like 4"x4"x1" cartridges holding a few hundred feet of 1/2 inch
write once digital paper.  This is expected to appear in the next
year or so in the $20-40k price range with 50,000 megabytes/cartridge
and with a 10 cartridge loader.  This gives 1/2 a terabyte online
with similar delays to the jukebox.  It makes more sense to me.

Dick