[net.lang.mod2] large SETs desired

broman@noscvax.UUCP (Vincent P. Broman) (10/03/85)

<> 

Both Pascal and Modula-2 offer simple and useful notations for SETs, with
builtin operators to manipulate them, making a powerful mathematical 
abstraction available in a high-level form.  I use sets often.  Unfortunately, 
the warning that SETs may be restricted by the implementor to base types with 
few (8, 16?) elements, forces me often to replace SETs by ARRAYs of SETs and 
write procedures to manipulate them, in order to write portable code.  This is 
clumsy and detracts much from their notational convenience.  

I would suggest that rule 2 on page 65 of PIM2 end approximately as follows:

       "2. ... set a limit to the number of elements admissible in base types.  
       This should be no more restrictive than memory or segmentation 
       considerations necessitate. In any case, base types of up to N (256?) 
       elements shall be supported." 

Supporting larger SETs in the compiler would probably require that some SET 
operations be realized as single machine instructions, others as subroutine 
calls. This is a reasonable price to pay to make SET types machine 
independent.  

Wirth slipped a bit when he said on the page (65) mentioned above, that "set 
types are a powerful tool ... to express operations on individual bits of an 
operand ...." Actually, the BITSET is the low-level tool used to manipulate 
bits on your machine.  The user-defined SET types implement the high-level 
concept of a mathematical set, and if they are mapped into bit operations by 
the compiler, that fact has little relevance to how I should use them.  
Even on a decimal machine BITSET may disappear, but I would expect user-defined
SETs to work.

Hurray that Modula-2 has SETs.  Among the pile of features pasted onto Ada, 
this important one seems to have been left out.

Vincent Broman   (619) 225-2365      MILNET: broman@nosc
Analysis Branch, code 632            UUCP: {ihnp4,decvax,akgua,dcdwest,
Naval Ocean Systems Center               allegra,ucbvax}!sdcsvax!noscvax!broman
San Diego, CA  92152                 ICBM: 32deg 42min N / 117deg 14min W

quiroz@rochester.UUCP (Cesar Quiroz) (10/07/85)

In the posting of the reference, the problem of supporting reasonably 
large sets is considered.  A comment is also made as to BITSETs being
a good starting point for that.  I want to add a few of my own opinions,
in the hope of public scrutiny. 

Sets are a Good Thing To Have only if you can manage to use them without
undue hassle.  Pascal first and the Modula2 provide powerful *paper* set 
types, that become much less useful thanks to not being supported beyond
the width of a machine's word (just a few elements).  This, of course,
doesn't prevent competent implementors from doing things in a more usable
way, but this still makes the portability-minded user beware.  And so those
enlightened users end up writing routines that amount exactly TO THEIR OWN
IMPLEMENTATION OF SETS!

Now there is a reason for this (so I think).  Doing Sets right in all their
glory is hard.  In some broad sense, it is similar from doing strings right.
Representations that are good for one application slow down another and 
the decision is hard (or impossible) to make statically (at compile time).
So it is almost sure that a good number of applications will end up requiring
a different set implementation.

Why this happens may be left open for discussion.  At any rate, my preference
here would be to left Sets (and Strings, for that matter) out of the language,
but support them with adequate primitives (usable for more than implementing
sets) and standard libraries (you can find a few implementation options that
cover most of the applications).

This connects to the last point, the support for bit strings.  Saying that
*sets* are a good mechanism to handle *bits* is going backwards atrociously.
When you need to think of *bits*, the least you want is something else to
get in your way.  I barely remember that I used to be very pleased with the
way bit-strings were implemented in AlgolW and still think that that is the
way to go.  Provide the ability to refer to arrays of bits (and guarantee
only that they are densely packed in an implementation dependent way, so they
may be different from array of [0..1]) and provide a reasonable way to
denote constants (i.e., "masks") and you get an abstraction that can be 
conveniently mapped to something sensible in most architectures and that 
allows you to write at least one implementation of sets-of-reasonable-length
as a standard library.

Cesar
-- 
Cesar Augusto  Quiroz Gonzalez

Department of Computer Science     {allegra|seismo}!rochester!quiroz
University of Rochester            or
Rochester,  NY 14627               quiroz@ROCHESTER

bart@reed.UUCP (Bart Massey) (10/08/85)

> In the posting of the reference, the problem of supporting reasonably 
> large sets is considered.

The other referenced poster says that he thinks set ops should be
extrinsic to the language, and so should strings.


I don't think I buy it for M2, for the simple reason of code legibility.
For sets as a mathematical construct, like numeric expressions (another
place where "a lot of tradeoffs are made"), OPERATORS, and not functions,
are desirable as the atomic elements for manipulation of the object.  One
of the reasons it isn't very much fun to write mathematical functions in
LISP, but is in APL, is that an APL function is symbolically much like
what you would write on paper, whereas LISP functions bear no obvious
symbolic relation to the equation they represent.  And M2 of course does
not allow you to define operators.  In Algol68 it might make
sense to make sets extrinsic.

But note also that fixed size sets have a straightforward and
highly efficient internal representation on most machines.  It seems
reasonable to make the compiler generate code to deal with this representation,
instead of using more inefficient user code to do it, since set ops are
often so useful.  I think that this is also a good argument for adding
intrinsic ops to the language to implement strings, as well as stacks,
queues, and other commonly used data structures.  Of course, the tradeoff
is, as always, compiler complexity...

As per the size of the sets, it seems reasonable to insist that compilers
support sets of (bits/byte)*(largest value that fits in a WORD) elements.
This is enough to support any reasonable use, and is still efficient 
to index...

Wouldn't it be nice to be able to write the Sieve of Eratosthenes just
like you think it?

				Bart Massey
				...tektronix!reed!bart

throopw@rtp47.UUCP (Wayne Throop) (10/09/85)

> In the posting of the reference, the problem of supporting reasonably 
> large sets is considered.  A comment is also made as to BITSETs being
> a good starting point for that.  I want to add a few of my own opinions,
> in the hope of public scrutiny. 

Ditto.

> Sets are a Good Thing To Have only if you can manage to use them without
> undue hassle.

Agreed.  However, I think that you are a little pessimistic about what
constitutes "undue hassle" (for the compiler writer).

> Pascal first and the Modula2 provide powerful *paper* set
> types, that become much less useful thanks to not being supported beyond
> the width of a machine's word (just a few elements).

Your tack seems to be that supporting sets beyond a machine word is hard
for the compiler, and that sets should therefore be dropped.  I
disagree.  I don't find implementing multi-word sets to be any more
difficult than implementing, say, complex arithmetic.  FORTRAN has had
that for years, and nobody thinks of it as unreasonably hard to do.

> Doing Sets right in all their
> glory is hard.  In some broad sense, it is similar from doing strings right.
> Representations that are good for one application slow down another and 
> the decision is hard (or impossible) to make statically (at compile time).
> So it is almost sure that a good number of applications will end up requiring
> a different set implementation.

This is true.  But consider substituting "floating point numbers"
for "sets" in the above.  After all, most languages implement a single
format of floating point number (sometimes with a choice of mantissa
length).  Sometimes this just won't serve the needs of developers,
because they need an extended exponent range, or they need different
error-propogation behavior, or they need 16-bit floats for space
economy, or whatever.  I don't take this to mean that languages
shouldn't support floating point numbers.  Nor do I think the above
paragraph shows that sets should not be included in "primitive" language
features.

> Saying that
> *sets* are a good mechanism to handle *bits* is going backwards atrociously.

Also true.  However, I think that languages should support *both* sets
and bit strings as primitive notions.  Just because A can be implemented
using B is no reason for a language not to support A.  Some examples are
that characters can be implemented using integers and complex numbers
can be implemented using floats.  In neither case do I take this to be
an argument that characters or complex numbers should not be supported.

Similarly, I think that supporting arbitrary-length sets is a reasonable
thing for a language to do, and further that Modula "ought to have" done
it.

> Cesar Augusto  Quiroz Gonzalez   {allegra|seismo}!rochester!quiroz
-- 
Wayne Throop at Data General, RTP, NC
<the-known-world>!mcnc!rti-sel!rtp47!throopw

paul@greipa.UUCP (Paul A. Vixie) (10/10/85)

In article <1982@reed.UUCP> bart@reed.UUCP (Bart Massey) writes:
>As per the size of the sets, it seems reasonable to insist that compilers
>support sets of (bits/byte)*(largest value that fits in a WORD) elements.
>This is enough to support any reasonable use, and is still efficient 
>to index...

Let's see... for a VAX, that's

	(8) * (2**32)

This may seem efficient to index, but to me that's still alot of bits.
Even so, I agree with this.

A feature I used heavily when writing non-portable Pascal was constant
SET OF CHAR expressions out in the code... thus

	if (ch = 'e') or (ch = 'q') or (ch = 'x') then ...

was often written

	if (ch in ['e', 'q', 'x']) then ...

. This was more efficient AND more legible, a rare combination.  Wirth
decided to limit sets to having no more elements than a WORD has bits
because everybody implements sets with bit strings (anybody besides me
remember when we used to multiply primes together?), and it's easier
to check the status of a bit if you know in advance what byte it's in.

But for the additional readability, I'd prefer a compiler smart enough
to generate the extra code for manipulating sets larger than a WORD can
handle.  It won't be efficient, but sets that size would not often be
used by system programmers, so the "usual" case (small sets) loses nothing.

Even system programmers need large bit strings once in a while.  Cached
allocation maps for mass storage devices, for example.  Will high-level
Modula code for selecting an element out of an ARRAY OF BITSET be as
efficient as compiler-generated code for SET OF [0..999999] ?  I think not.
-- 

	Paul Vixie
	{decwrl dual pyramid}!greipa!paul

quiroz@rochester.UUCP (Cesar Quiroz) (10/11/85)

From article <1982@reed.UUCP> (bart@reed.UUCP (Bart Massey)):
> ...
>The other referenced poster says that he thinks set ops should be
>extrinsic to the language, and so should strings.
>
>I don't think I buy it for M2, for the simple reason of code legibility.
>For sets as a mathematical construct, like numeric expressions (another
>place where "a lot of tradeoffs are made"), OPERATORS, and not functions,
>are desirable as the atomic elements for manipulation of the object.  One
>of the reasons it isn't very much fun to write mathematical functions in
>LISP, but is in APL, is that an APL function is symbolically much like
>what you would write on paper, whereas LISP functions bear no obvious
>symbolic relation to the equation they represent.  And M2 of course does
>not allow you to define operators.  In Algol68 it might make
>sense to make sets extrinsic.
> ...
Absolutely.  I think there is agreement that most design decisions have to
take a side and leave something out in order to make the rest practical to
implement.  If one needs a language in order to deal with a particular domain,
it is best when that language presents the right level of abstraction (i.e.,
compiler-level support for the necessary entities as first-class objects).  So
you have numerically oriented languages that give you lots of power in terms of
accuracy control and plenty of pertinent types (say, complex numbers of various
precisions) and so on.  Similarly, if the language must deal mainly with text,
it is almost imperative that strings are fully supported (say, that you be able
to express the replacement of a substring with another of different length, with
all the surgery being hidden).

This agreed (so I hope), I must say I failed in my previous posting by not
making explicit the design domain I thought corresponded to M2.  My personal
impression comes from the first line of the Report: "Modula-2 grew out of a 
practical need for a general, efficiently implementable systems programming
language for minicomputers."  I think that those early motivations are better
served leaving sets out of the language in favor, say, of better handling of
bit strings.  (My reasons for believing this are documented in too much a
lengthy fashion in my previous article.) This doesn't prevent M2 from growing 
into a sound language for other types of applications, where first-order 
support for strings and sets would have to be deemed more pertinent.  Currently,
ICON and SETL are good exponents of how to deal with those domains.

The argument in favor of guaranteeing sets of at least N elements, for N 
a big number, is certainly a good compromise.  N<=32 is a total loser.  I'd
like to notice, also, that Bitsets in M2 are closer to a practical 
implementation of bit strings than Sets are to their intent.  Having to think
of "membership" when I mean "being on" is just the class of misuse that I
dislike in this aspect of the language.

-- 
Cesar Augusto  Quiroz Gonzalez

Department of Computer Science     {allegra|seismo}!rochester!quiroz
University of Rochester            or
Rochester,  NY 14627               quiroz@ROCHESTER

rcd@nbires.UUCP (Dick Dunn) (10/15/85)

> > Pascal first and the Modula2 provide powerful *paper* set
> > types, that become much less useful thanks to not being supported beyond
> > the width of a machine's word (just a few elements).
> 
> Your tack seems to be that supporting sets beyond a machine word is hard
> for the compiler, and that sets should therefore be dropped.  I
> disagree...

I'll also go along with the position that multi-word sets are not that hard
to implement.  In a project using a language based on Pascal some years
ago, we implemented LARGE sets--the upper limit on set size was (for some
arcane reasons) 100 000 000 elements.  This is more nearly a storage
limitation than a limit on set size, and certainly answers most desires for
large sets!  It did, however, introduce two restrictions which were not
present in Pascal's set mechanism:

	- First, the basetype of a set was required to be evident from its
	  left context.  This was due in part to the simplistic analysis in
	  the compiler we were using.  It turns out to be a very odd re-
	  striction to explain but one which causes no trouble in practice.
	  An example of a statement violating the constraint is
		if [1,3,5]=S1 then ...
	  (which can be fixed by exchanging the operands of =)
	  The constraint is always met in assignments and procedure calls.
	  We could have eliminated it.
	
	- Second, set types were considered distinct if their basetypes
	  were distinct.  This avoids the problems associated with
	  performing set operations on sets with, say, disjoint basetypes
	  which are nonetheless subranges of the same parent type.  This is
	  an area in which Pascal is surprisingly (needlessly?) lenient,
	  and may be one of the reasons that large sets are not allowed--
	  all sets would have to be maximum size, where in our scheme sets
	  were stored in the smallest natural storage unit that would
	  contain the required number of bits.
-- 
Dick Dunn	{hao,ucbvax,allegra}!nbires!rcd		(303)444-5710 x3086
   ...Simpler is better.

broman@noscvax.UUCP (Vincent P. Broman) (10/19/85)

> I'll also go along with the position that multi-word sets are not that hard
> to implement.  In a project using a language based on Pascal some years...
> 	- First, the basetype of a set was required to be evident from its
> 	  left context.  This was due in part to the simplistic analysis in
> 	
> 	- Second, set types were considered distinct if their basetypes
> 	  were distinct.  This avoids the problems associated with
> 	  performing set operations on sets with, say, disjoint basetypes
> 	  which are nonetheless subranges of the same parent type.  This is
> -- 
> Dick Dunn	{hao,ucbvax,allegra}!nbires!rcd		(303)444-5710 x3086
>    ...Simpler is better.


I suspect that the difficulties in Doing Strings Right, lie mainly in their
having dynamically varying lengths, and in the lack of typename for string
literals and expressions. Modula2 allows large SETs to be Done Right, in all
their glory, because SET literals and expressions have types and basetypes
easily known at compile time, as a result of Wirth's choice of syntax:
"settypename{elt,elt,...}" for any thing but a BITSET, ... a clear winner.

BTW, the intended domain of application for Modula2 seems to me to embrace
general-purpose programming of single-processor computers, including
systems work, numerics, CAD, etc. I think so because Lilith was designed to
run on M2 alone, no f77, C, or anything else as auxilliary.
A good implementation of Modula2 would give Ada a good run for its money
in generality of application, save the area of real-time programming only.

Vincent Broman   (619) 225-2365      MILNET: broman@nosc
Analysis Branch, code 632            UUCP: {ihnp4,decvax,akgua,dcdwest,
Naval Ocean Systems Center               allegra,ucbvax}!sdcsvax!noscvax!broman
San Diego, CA  92152                 ICBM: 32deg 42min N / 117deg 14min W