[net.database] A variant of the streams idea

david@ukma.UUCP (David Herron, NPR Lover) (12/16/85)

I just finished reading dmr's streams article from the BSTJ of last year,
and an idea has occurred to me (actually re-occurred).

I originally had this idea when reading about sockets...


Could one use a variant of this to provide "funny" kinds of files?


What I mean is, ISAM is merely a protocol for using a file, right?
So to use a database "properly", open the file and push a <something>
onto the file which does the indexing/etc.  And you would still
be able to use the file in a "raw" mode for low-level patching, etc.


Is everybody confused now?
-- 
David Herron,  cbosgd!ukma!david, david@UKMA.BITNET.

Experience is something you don't get until just after you need it.

greg@ncr-sd.UUCP (Greg Noel) (12/20/85)

In article <2416@ukma.UUCP> david@ukma.UUCP (David Herron, NPR Lover) writes:
>I just finished reading dmr's streams article from the BSTJ of last year,
> ....
>Could one use a variant of this to provide "funny" kinds of files?
>
>What I mean is, ISAM is merely a protocol for using a file, right?
>So to use a database "properly", open the file and push a <something>
>onto the file which does the indexing/etc.  And you would still
>be able to use the file in a "raw" mode for low-level patching, etc.

I spoke to DMR after he presented his paper and suggested something
similar.  In fact, if you look at it closely, a buffered filesystem is
just a specialized interface pushed on top of the raw filesystem.  I
would like to see the whole stream mechanism generalized so that I could
push a stream module in front of \any/ file, not just a tty file.  This
seems to be a simplification and unification of the concept of file
accessing, and Unix is renouned for the simplification and unification
of concepts, so why not?
-- 
-- Greg Noel, NCR Rancho Bernardo    Greg@ncr-sd.UUCP or Greg@nosc.ARPA

gwyn@brl-tgr.ARPA (Doug Gwyn <gwyn>) (12/22/85)

> I would like to see the whole stream mechanism generalized so that I
> could push a stream module in front of \any/ file, not just a tty file.

Several of us have discussed this idea before.
It comes down to the fact that there really are
significant differences between random-access
(disk) files and sequential (communication)
files.  To force random files into the stream
model would require sacrificing some of their
desirable properties (seekability, sharability,
speed), alas.

Nice try, though.

david@ukma.UUCP (David Herron, NPR Lover) (12/24/85)

In article <964@brl-tgr.ARPA> gwyn@brl-tgr.ARPA (Doug Gwyn <gwyn>) writes:
>> I would like to see the whole stream mechanism generalized so that I
>> could push a stream module in front of \any/ file, not just a tty file.
>
>Several of us have discussed this idea before.
> ...
>     To force random files into the stream
>model would require sacrificing some of their
>desirable properties (seekability, sharability,
>speed), alas.
>
>Nice try, though.

Why do you lose "seekability"?  Pushing a protocol on top of a file
potentially brings in a whole set of ioctl()'s that can be performed.
So lseek() becomes ioctl(fd, FLSEEK, offset) or some such.  Other
operations can be performed with similar ioctl()'s.

Right?

Or am I missing something?

-- 
David Herron,  cbosgd!ukma!david, david@UKMA.BITNET.

Experience is something you don't get until just after you need it.

gwyn@brl-tgr.ARPA (Doug Gwyn <gwyn>) (12/26/85)

> Why do you lose "seekability"?  Pushing a protocol on top of a file
> potentially brings in a whole set of ioctl()'s that can be performed.
> So lseek() becomes ioctl(fd, FLSEEK, offset) or some such.  Other
> operations can be performed with similar ioctl()'s.

By the time you have made open() slip necessary protocol
modules onto disk files, etc. you end up with just another,
kludgier, implementation of the UNIX file system.

Streams work as nicely as they do because they are modeled
as full-duplex pipes with protocol "filters" inserted into
the pipeline.  Stream data comes from somewhere and goes
somewhere.  Disk files just sit; there is no inherent "flow".
File data is not normally "consumed", for example.  Now,
FIFOs and pipes are a different matter, and indeed pipes are
implemented using streams on 8th Edition UNIX.

One could certainly add features to ordinary files.  I once
thought up one, which Mike Karels told me had already been
invented and called "portals".  This would be code associated
with a file that would "fire up" when the file was accessed;
not too different from the idea of attaching protocol modules
to files.  Something like this might be worth doing,
especially for databasing, but not by stretching the stream
I/O system beyond its basic model.

greg@ncr-sd.UUCP (Greg Noel) (12/26/85)

In article <964@brl-tgr.ARPA> gwyn@brl-tgr.ARPA (Doug Gwyn <gwyn>)
claims that there are
>significant differences between random-access
>(disk) files and sequential (communication)
>files.  To force random files into the stream
>model would require sacrificing some of their
>desirable properties (seekability, sharability,
>speed), alas.

I'm prepared to believe that the current implementation of streams makes
this difficult or impossible, but I'm not prepared to believe that the
idea itself is difficult or impossible.  After all, what is filesystem
buffering except a protocol pushed on top of raw filesystems?  Perhaps
we should take this conversation off-line while you try to convince me,
since it doesn't seem to be something of general interest.
-- 
-- Greg Noel, NCR Rancho Bernardo    Greg@ncr-sd.UUCP or Greg@nosc.ARPA

ark@ut-sally.UUCP (Arthur M. Keller) (12/27/85)

In article <376@ncr-sd.UUCP> greg@ncr-sd.UUCP (Greg Noel) writes:
>In article <964@brl-tgr.ARPA> gwyn@brl-tgr.ARPA (Doug Gwyn <gwyn>)
>claims that there are
>>significant differences between random-access
>>(disk) files and sequential (communication)
>>files.  To force random files into the stream
>>model would require sacrificing some of their
>>desirable properties (seekability, sharability,
>>speed), alas.
>
>I'm prepared to believe that the current implementation of streams makes
>this difficult or impossible, but I'm not prepared to believe that the
>idea itself is difficult or impossible.

Some people consider random access to be "stream access with
repositioning".  That might work well in a single-user environment but
fails badly in a multi-user environment with concurrency control.  In
a file system or a database system, the granularity of locking is an
important concept.  It corresponds to the granularity of sharing.  It
may but need not correspond to the granularity of access, although the
two must be comparable (equal or one contained in the other for all
practical purposes).

In streams, the granularity of access is not a very well defined
concept.  You don't really want to lock the "next 100 bytes I will
next read".  Rather, you probably want to lock the "next group of
related fields" commonly called a record.

The notion of streams generally indicates one source and one consumer
for each dataflow pathway, although there may or may not be a
separate, often implicit reverse pathway.  The notion can be
generalized to multiple producers and consumers, but still obeying a
FIFO (or priority queue) discipline.  No notion of modification of the
data in a stream exists, although a transformer may be interposed that
consumes a stream and produces another distinct stream by a
transformation of the consumed stream.  In a randomly accessed, shared
file system, the data do not follow anything resembling a FIFO
discipline.  The data are not consumed, they are referenced; they may
also be created, updated, and destroyed.  To consider a shared file to
be a stream actually means that the file is encapsulated in a process
and your stream is communicating with that process.  Then the process
uses a more traditional discipline to interact with the file.  This
would involve a protocol transformation and the attendant overhead.

I've only briefly touched on some of the issues, but I hope that this
can give the readership of this newsgroup a feeling for the problems
involved.

Prof. Arthur M. Keller
The University of Texas at Austin
Department of Computer Sciences

-- 
------------------------------------------------------------------------------
Arpanet: ARK@SALLY.UTEXAS.EDU
UUCP:    {gatech,harvard,ihnp4,pyramid,siesmo}!ut-sally!ark

mishkin@apollo.uucp (Nathaniel Mishkin) (12/27/85)

At Apollo, we've developed a system for extending the concept of "stream".
Basically, every "object" (read "file" if you're not familiar with the
Smalltalk/object-oriented view of the world) has a type and every type
has a "manager" (read "subroutine library") that contains one entry point
(procedure) for each "operation" (read "generic procedure") that the
type supports.

In the case of stream I/O (the typed object view of the world extends
beyond simple stream I/O), the operations are things like "read", "write",
"seek", etc.  Operations are grouped into "traits" (read "interfaces").
The stream I/O facility defines a set of traits including "IO" (contains
the basic I/O operations listed above), "Socket" (contains the 4.2bsd
socket operations like "bind", "connect", "listen", etc.), and "Pad"
(contains operations for manipulating windows).

User's can define new types, write their associated managers, install
them into the system (without having to touch existing system source
code), create objects of the new types and (lo and behold) have existing
programs that use the stream I/O interface (i.e. that call "read", "write",
etc.) work on the new objects.  Type IDs are 64 bits and unique, so there's
never a problem when moving objects from one system to another (as long
as you bring the managers with you).  Managers run in user state and
are dynamically loaded when needed.

As you might guess, there are lots of uses for this facility.  For example,
in the case of a DBMS, one simple facility you might want is to be able
to read the DBMS like a sequential ASCII file, independent of what its
real internal structure is.  This might not be appropriate or reasonable
for every database, but to take a simple example, today, you need a special
program in order to dump the contents of a "dbm(3X)" database file.  On
an Apollo, such files could be typed as being "dbm" files.  Then, you
could write a manager for the "dbm" type that implemented the IO trait
operations for that type.  (You can think of the manager as being a
different form of the special dump program.) After you did this, you'd
be able to run program like "grep" on the "dbm" files and get useful
results.

If you were really ambitious, you might want to define a new trait --
call it the "ISAM" trait -- that had operations like "seek_by_key" that
took logical keys (i.e. NOT byte offsets) as arguments.  (Currently,
only Apollo -- not users -- can define new TRAITS, but we hope to have
this fixed sometime.) The idea is that this trait could be supported
by different DBMS's, that you'd be able to write programs that used those
operations, and that those programs would work with different DBMSs.
Of course it's not clear that you could come up with a set of operations
that made sense to enough different DBMS's to be worthwhile.  (Consider
the question of what the term "key" means to different DBMS's.)  But
it'd be interesting to investigate.

ark@ut-sally.UUCP (Arthur M. Keller) (12/29/85)

In article <2afa6c05.3166@apollo.uucp> mishkin@apollo.UUCP (Nathaniel Mishkin) writes:
>At Apollo, we've developed a system for extending the concept of "stream".

I would argue that what you have really done is implemented the concept
of streams using the concept of objects.  Since the concept of objects
is at least as powerful as arbitrary procedure calls, this is not too
surprising.

>As you might guess, there are lots of uses for this facility.  For example,
>in the case of a DBMS, one simple facility you might want is to be able
>to read the DBMS like a sequential ASCII file, independent of what its
>real internal structure is.

This is the concept of information hiding.  You are using the features
common to streams and databases, so it is not surprising you can fit
a streams-like interface on top of a database.  It's also possible that
the streams-like interface was chosen to be a subset of the database-like
interface, but that's not necessary.

>If you were really ambitious, you might want to define a new trait --
>call it the "ISAM" trait -- that had operations like "seek_by_key" that
>took logical keys (i.e. NOT byte offsets) as arguments.

If you do decide to do that, I'd suggest supporting any number of keys,
not just one, and don't unnecessary bind the concept of a unique index
with the concept of a clustered index, but that's a whole 'nother
pet peeve of mine.

>Of course it's not clear that you could come up with a set of operations
>that made sense to enough different DBMS's to be worthwhile.  (Consider
>the question of what the term "key" means to different DBMS's.)  But
>it'd be interesting to investigate.

It should be possible to do it with any two databases designed using
the same model (relational, hierarchical, network, entity-relational,
functional, or what-have-you).  Furthermore, a sufficiently powerful
network-model database implementation could have a relational-model
interface on it, which to the user would be indistinguishable from a
relational-model database written using a traditional relational
implementation style.

Arthur M. Keller

-- 
------------------------------------------------------------------------------
Arpanet: ARK@SALLY.UTEXAS.EDU
UUCP:    {gatech,harvard,ihnp4,pyramid,siesmo}!ut-sally!ark

jack@boring.UUCP (Jack Jansen) (01/02/86)

Doug Gwyn states that files are completely different
from pipes/FIFOs/etc, in that a diskfile doesn't have
a data flow, like the others.

I agree on this, but there's a way to make a file look like
a stream: just say that your not talking to a *file*, but
to a *file server*. This way, you get your stream model
back. Now it's easy to insert modules that do ASCII-EBCDIC
conversion, sparse file handling, database lookups, even
readahead/writebehind, without modifying the basic low-level
file server.
There are great advantages to the file-server model:
- You don't pay for features you didn't ask for (ever heard
database people raving about unix readahead?)
- It's easier to maintain, since it consists of more, but smaller
modules.
- Remote filesystems come for (almost) free.

This is, by the way, the approach used in the Amoeba distributed
operating system, and in some other message-passing operating
systems. Hmm. Time to move to net.os?
-- 
	Jack Jansen, jack@mcvax.UUCP
	The shell is my oyster.

larry@ingres.ARPA (Larry Rowe) (01/06/86)

In article <6717@boring.UUCP> jack@mcvax.UUCP (Jack Jansen) writes:
>
>There are great advantages to the file-server model:
>- You don't pay for features you didn't ask for (ever heard
>database people raving about unix readahead?)

In single-user benchmarks read-ahead is a definite win for queries
that scan a lot of pages to answer a query.  The reason should be
pretty obvious: read-ahead overlaps cpu and i/o processing.  a simple
dbms will run very nicely with unix-style read-ahead but a sophisticated
dbms will eventually have to replace the general operating system read-ahead
with a smarter read-ahead.  the reason has to do with what page gets
read on read-ahead.  most dbms's impose a page structure on the data
file that includes a forward pointer to the next primary page and a pointer
to the overflow pages for the current page.  when doing read-ahead,
you want to scan the pages in the order: primary page, overflow page
overflow page, ..., primary page, overflow page, etc. general unix
read-ahead reads the next logically sequential page which won't give
this ordering.
	larry