[comp.unix.wizards] A Shared Libraries Solution

craig@unicus.UUCP (Craig D. Hubley) (10/07/87)

One effective way to deal with revisions to shared libraries is by maintaining
several versions around, and have PROGRAMS know which revision levels they 
can count on to perform code.

`Services', which are programs such as print or mail services, but could
just as easily be shared libraries (which should almost never be compiled
into code) have a revision level, such as 10.2, where the 10 is a major
revision level, and the .2 signifies changes that are not known to cause
ANY program to break.  Each `service' or library knows what revision levels
it has available or can emulate.

If a program breaks inside or `on the border' of a library routine,
it stores that fact, and the revision level of the library it was using.
Thereafter, it will ask for `Print Service 8.0 - 10.0', and if only 10.2
is available, it will fail with a robust error, perhaps searching elsewhere
on the system for another print server or archived library of old service
routines.  In fact, most services can deal quite handily with such problems,
simply by having backup disk storage that contains the older services, if
a particular site has programs that need them.  If you don't need the older
services, then don't store them.  The worst that will happen is that your
program will try the new one, fail, back up to the error (if possible),
or restart if not, and ask you to make the old one available.
On a microcomputer, this might mean inserting a floppy.  Quite a bit friendler
than weird data errors, hein?  

XNS uses a similar system for services to know whether or not they can 
serve various programs, though I don't know if the program-based revision
tracking is automatic.  It might be by now.

This method is effective because:

	It frees you, unlike "check the interface" from having to debug
	the whole system before getting an actual solution.

	The revision-tracking is automatic.

	Programs assume new services will work until they actually fail.

	Programs can find problems and log them, notifying the user,
	or users can find problems.  In either case, the `buggy' revision
	will no longer be used by that program.  Or at least, that copy
	of that program.  An alternative would be to have the library store
	the failed-program data, but that would impose a burden.

	Unlike "Revision X.Y or greater", such as the Amiga uses, it does
	not assume that upgrades are always robust.  As anyone involved in
	large systems design should know, the NUMBER of bugs remains constant
	above a certain size... they only move around.

	It is being effectively employed, at least partially, in XNS, and 
	I believe that a similar, though less straightforward, system is
	used in IBM mainframes.

Some disadvantages:

	Programs would have to store failed-version information on every
	shared library they use.  This is fairly minimal in terms of size,
	but restarts, and retrys, could use up a fair bit of computing time,
	where libraries change often, or many copies of a program exist.

	The automatic-logging aspect of the system would be subject to bugs.

	Users could become `spoiled' enough to count on the system to find
	incompatibilities, and fail to look for data errors themselves.

	Shared libraries would have to be checked, on open, for compatibility.

Considering some of these are problems already extant in the existing
bug-spotting procedures, and the worst thing that gets added is a little
extra data and a few more cycles to open libraries, it seems pro overall.

Any comments, particularly from those who have used distributed services
under such a system?

Perhaps more importantly from a UNIX point of view, could it be effectively
implemented on today's systems?

This has been an interesting debate.  Keep it up.

	Craig Hubley, Unicus Corporation, Toronto, Ont.
	craig@Unicus.COM				(Internet)
	{uunet!mnetor, utzoo!utcsri}!unicus!craig	(dumb uucp)
	mnetor!unicus!craig@uunet.uu.net		(dumb arpa)

steve@nuchat.UUCP (Steve Nuchia) (10/16/87)

In article <1057@unicus.UUCP>, craig@unicus.UUCP (Craig D. Hubley) writes:
> One effective way to deal with revisions to shared libraries is by maintaining
> several versions around, and have PROGRAMS know which revision levels they 
> can count on to perform code.

With you so far...

> If a program breaks inside or `on the border' of a library routine,
> it stores that fact, and the revision level of the library it was using.
> Thereafter, it will ask for `Print Service 8.0 - 10.0', and if only 10.2
> is available, it will fail with a robust error, perhaps searching elsewhere
> on the system for another print server or archived library of old service
> routines.  In fact, most services can deal quite handily with such problems,
> simply by having backup disk storage that contains the older services, if
> a particular site has programs that need them.  If you don't need the older
> services, then don't store them.  The worst that will happen is that your

Still with you...

> program will try the new one, fail, back up to the error (if possible),
> or restart if not, and ask you to make the old one available.

Does this not beg the question of how the program _detects_ the failure?

> On a microcomputer, this might mean inserting a floppy.  Quite a bit friendler
> than weird data errors, hein?  

Jah, if it works.

> XNS uses a similar system for services to know whether or not they can 
> serve various programs, though I don't know if the program-based revision
> tracking is automatic.  It might be by now.

Scarier and scarier.

> This method is effective because:
> 	It frees you, unlike "check the interface" from having to debug
> 	the whole system before getting an actual solution.

I think I understand you to be saying that your approach allow the system
to run in the presence of a new, untested library?  How does this differ
(in the light of the sequel) from the "old way" ?

> 	The revision-tracking is automatic.

True, under the assumptions.  Is this a Good Thing?

> 	Programs assume new services will work until they actually fail.

This is the heart of the matter.  Your proposal is for an optimistic
policy, wheras the traditional approach is pessimistic.  In the pessimistic
approach a program asks for the library(s) it has been tested with, and
someone has to update its idea of what libraries are good manually.  In
your optimistic approach a program would use the latest available library
that had nod been _found_to_be_buggy_ (in a relative sense).

> 	Programs can find problems and log them, notifying the user,
> 	or users can find problems.  In either case, the `buggy' revision
> 	will no longer be used by that program.  Or at least, that copy
> 	of that program.  An alternative would be to have the library store
> 	the failed-program data, but that would impose a burden.

Exactly how are programs to do this?  Is this not a close relative to
the halting problem?  I've heard that the ESS5 control program was
over 75% "audit" code - keeping an eye on the other 25%.  This seems
like an extreme penalty (if my understanding is correct) for not
proving the operative 25%, and illustrates the practical difficulty
of software self-test.

> 	Unlike "Revision X.Y or greater", such as the Amiga uses, it does
> 	not assume that upgrades are always robust.  As anyone involved in
> 	large systems design should know, the NUMBER of bugs remains constant
> 	above a certain size... they only move around.

Agreed, the x.y or greater approach is even more optimistic than yours,
since it makes no explicit provision for buggy (just "old") libraries.

> 	It is being effectively employed, at least partially, in XNS, and 
> 	I believe that a similar, though less straightforward, system is
> 	used in IBM mainframes.

Perhaps I misunderstand you.  Do these operational systems employ human
intervention in the error detection loop?

> Some disadvantages:
> 	Programs would have to store failed-version information on every
> 	shared library they use.  This is fairly minimal in terms of size,
> 	but restarts, and retrys, could use up a fair bit of computing time,
> 	where libraries change often, or many copies of a program exist.

Looks like a proper analysis.

> 	The automatic-logging aspect of the system would be subject to bugs.

True, but such things can be managed easily - any _specific_ library
service can be made robust, the dificulty that brings us to this
discussion lies in making a large, diverse, and ever-changing
collection of services robust in the agregate.

> 	Users could become `spoiled' enough to count on the system to find
> 	incompatibilities, and fail to look for data errors themselves.

Naive users are a problem in many areas, password security being one of the
most well known, with inadequate failure reporting running a close second.

> 	Shared libraries would have to be checked, on open, for compatibility.

If by this you mean comparing them against the stored list of compatibilities,
I had understood this to be a part of the overhead of that scheme.  Do you
have something else in mind?  Perhaps you allude to the "testing" of the
library on first encounter?

> Considering some of these are problems already extant in the existing
> bug-spotting procedures, and the worst thing that gets added is a little
> extra data and a few more cycles to open libraries, it seems pro overall.

Actually, assuming I properly understand you, the user complaceny is
probably the worst that gets added.  Especially if this extends to
the software engineering folks, who _should_ be testing things and
not relying on a mathematically unsound (isomorphic with the halting
problem) problem detection and logging scheme.

> Any comments, particularly from those who have used distributed services
> under such a system?

I think the system you advocate, call it "optimistic but reactionary",
is a useful addition to the family of library sharing algorithms.  It
should not be expected to work miracles, and indeed should be seen as
a way of integrating _user_ problem reporting into the library ungrade
cycle rather than eliminating human testing.

> This has been an interesting debate.  Keep it up.
I concur.
-- 
Steve Nuchia	    | [...] but the machine would probably be allowed no mercy.
uunet!nuchat!steve  | In other words then, if a machine is expected to be
(713) 334 6720	    | infallible, it cannot be intelligent.  - Alan Turing, 1947

daveb@geac.UUCP (10/18/87)

In article <400@nuchat.UUCP> steve@nuchat.UUCP (Steve Nuchia) writes:
>In article <1057@unicus.UUCP>, craig@unicus.UUCP (Craig D. Hubley) writes:
>> program will try the new one, fail, back up to the error (if possible),
>> or restart if not, and ask you to make the old one available.
>
>Does this not beg the question of how the program _detects_ the failure?
> ...
>I think I understand you to be saying that your approach allow the system
>to run in the presence of a new, untested library?  How does this differ
>(in the light of the sequel) from the "old way" ?

  One technique actually used was to have a human detect certain
errors by running with "EXperimental_Library" in his search path
before the tested libraries.  If a program-detectable error (a
mismatch, in practice) occurred, she got an error message.  If she
detected an error herself or received a message, she contacted the
author of the library or the system administrator (since that was
easy) before setting up a special referencing domain for the program
(which was not so easy, but at least possible).
  Humans would actually use >exl in their search paths, to be sure
of getting the newest versions of things.  Even I did. 

  There were other support facilities underneath the human tester,
obviously.  The most important was a translate-to-different-version
routine that the author of the new, improved library was required to
write for each incompatible data-structure of file-format change.

 --dave (I was discussing Mutlics, you understand) c-b
-- 
 David Collier-Brown.                 {mnetor|yetti|utgpu}!geac!daveb
 Geac Computers International Inc.,   |  Computer Science loses its
 350 Steelcase Road,Markham, Ontario, |  memory (if not its mind)
 CANADA, L3R 1B3 (416) 475-0525 x3279 |  every 6 months.

craig@unicus.UUCP (10/19/87)

Steve Nuchia writes:
>In article <1057@unicus.UUCP>, craig@unicus.UUCP (Craig D. Hubley) writes:
>> One effective way to deal with revisions to shared libraries is by maintaining
>> several versions around, and have PROGRAMS know which revision levels they 
>> can count on to perform code.
>> ...
>> If a program breaks inside or `on the border' of a library routine,
>> ...
>> program will try the new one, fail, back up to the error (if possible),
>> or restart if not, and ask you to make the old one available.
>
>Does this not beg the question of how the program _detects_ the failure?

Anything more complex than a simple error-message generating failure would not
be reliably detected.  That's why humans must stay in the loop.  The detection
is simply another process to run when application-level errors occur.  This is
just another error vector that essentially adds the information `version 9.4
doesn't work'.  Users could observe errors and generate this information from
within the application, with a little work, but then the problem of WHICH
shared library was responsible (when several are in use) becomes an issue.
Most automatic bug-loggers already do as much, placing the versions of the
shared libraries in use, and the version of the application code, in the 
bug-log header before the user describes the problem.  This gets sent to 
the wizards, who can then solve the halting problem themselves. :-)

>I think I understand you to be saying that your approach allow the system
>to run in the presence of a new, untested library?  How does this differ
>(in the light of the sequel) from the "old way" ?

Because it attempts to find the *known* working version first.  Only if 
this is unavailable does it try the new ones.  Eventually, one would assume
that they would be removed from the system.  This supposes multiple versions.
If simple failure is desired when the preferred `known' library is unavailable
that can be an option (perhaps a `response' field in the program that
holds `fail' `search for known version' `try new version' or `notify sysop').
Optomism or pessimism is then decided, application by application, at
compile or install time.  

>> 	Programs can find problems and log them, notifying the user,
>> 	or users can find problems.  In either case, the `buggy' revision
>> 	will no longer be used by that program.  Or at least, that copy
>> 	of that program.  An alternative would be to have the library store
>> 	the failed-program data, but that would impose a burden.
>
>Exactly how are programs to do this?  Is this not a close relative to
>the halting problem?  I've heard that the ESS5 control program was

The bulk of the detection is done by users.  Only the reporting is automated,
and the prevention of further errors.  I didn't mean to imply that the program
was to find problems with its own operation.  Rather I meant that the OS
facilities for error detections would find problems by running the program
and seeing it fail in an obvious way.

>> 	It is being effectively employed, at least partially, in XNS, and 
>> 	I believe that a similar, though less straightforward, system is
>> 	used in IBM mainframes.
>
>Perhaps I misunderstand you.  Do these operational systems employ human
>intervention in the error detection loop?

Yes.  XNS doesn't have automatic `try-new-library', so far as I know,
though there was talk about it at one point, I think.  Programmers actually
have to modify the Courier programs that use resources over the Ethernet.
I could be wrong about this.  Perhaps there's an XNS wizard out there ?

>> 	Users could become `spoiled' enough to count on the system to find
>> 	incompatibilities, and fail to look for data errors themselves.
>
>Naive users are a problem in many areas, password security being one of the
>most well known, with inadequate failure reporting running a close second.

One way to aid this is to have all `new library tries' send some sort of 
output report to the user's mailbox, telling him to doublecheck the output
just in case some weirdness has occured.  Inform him that HE IS RESPONSIBLE
and that PROBLEMS CAN OCCUR.  Conscience can usually take care of the rest.
Unless there are a LOT of such updates.

>> 	Shared libraries would have to be checked, on open, for compatibility.
>
>If by this you mean comparing them against the stored list of compatibilities,
>I had understood this to be a part of the overhead of that scheme.  Do you
>have something else in mind?  Perhaps you allude to the "testing" of the
>library on first encounter?

Yes and Yes.  It's all overhead, in any case.  The "testing" overhead exists
in some sense or other, even if it's simply the user eyeballing his output
to make sure that his new floating-point library didn't float into deep space.

>> Considering some of these are problems already extant in the existing
>> bug-spotting procedures, and the worst thing that gets added is a little
>> extra data and a few more cycles to open libraries, it seems pro overall.
>
>Actually, assuming I properly understand you, the user complaceny is
>probably the worst that gets added.  Especially if this extends to
>the software engineering folks, who _should_ be testing things and
>not relying on a mathematically unsound (isomorphic with the halting
>problem) problem detection and logging scheme.

That's why user notification is important.  I should have mentioned it before.
The first thing to remember is that MOST PEOPLE DON'T KNOW THAT BUGS MOVE,
and that new versions of buggy code are often just buggy in a different way.
Even if it worked before.  I didn't mean to imply that the machine hides this
data from the user.  Only those parts of the testing process that it performs.

>I think the system you advocate, call it "optimistic but reactionary",

Good name.  But it could also be made pessimistic with the options I mention.
The reactionary component is necessary to minimize that worst of all bugs,
the long-undetected data error.  Once I heard that CNCP Telecommunications
lost $90 million over ten years of using a bad formula to calculate rates.
The error ?  Placing a +1 under, rather than outside, a square root sign.
Where there's one bug, there's usually many.  Let the programmers root `em out.

>is a useful addition to the family of library sharing algorithms.  It
>should not be expected to work miracles, and indeed should be seen as
>a way of integrating _user_ problem reporting into the library ungrade
>cycle rather than eliminating human testing.

Perhaps an appropriate solution is to have applications know which algorithm
they are to use (since we've already stored failure data).  Provide `em all.
OS code uses ONLY what it is told, quick hacks use anything. Caveat programmer.

Nothing can eliminate human testing.  But users, after all, are the only
persons with a vested interest in spotting ALL types of problems. Their
reports are, in a sense, the only meaningful ones.  I think it is lack of
familiarity with bug-reporting procedures, and the necessity of initiating 
the operation themselves, or perhaps recording the data until they can,
that causes things to go unreported.

>> This has been an interesting debate.  Keep it up.
>I concur.

Is the posting to comp.os.misc and comp.unix.wizards appropriate?
I keep wondering if it doesn't belong elsewhere, but I can't think
of anywhere. 
 
>>Steve Nuchia	    | [...] but the machine would probably be allowed no mercy.
>>uunet!nuchat!steve  | In other words then, if a machine is expected to be
>>(713) 334 6720	    | infallible, it cannot be intelligent.  - Alan Turing, 1947

*Craig's Corollary* to the Turing Test:
"Nobody will believe it's intelligent until it lies just to cover its ass."
Quote liberally.  No applause, just throw money.  

Hope this clears up the ambiguities, Steve.  Thanks for the opportunity
to think about these things in a more directed fashion.  I appreciate it.

	Craig Hubley, Unicus Corporation, Toronto, Ont.
	craig@Unicus.COM				(optomistic, Internet)
	{uunet!mnetor, utzoo!utcsri}!unicus!craig	(pessimistic,dumb uucp)
	mnetor!unicus!craig@uunet.uu.net		(pessimistic,dumb arpa)