[comp.lang.modula2] union types

cornwell@NRL-CSS.ARPA (Mark Cornwell) (03/23/88)

I'm trying to build a Modula-2 interface to a C-library an have hit
a snag.  Can anyone construct a suitable translation of:

    struct obj_name {
       int obj_type;
       union {
          int id;
          char *path;
       } obj_id;
    };


All my attempts are quite ugly and reqire introducing more extra
identifiers and syntax than I would like.

-- Mark Cornwell

nckurtz@ndsuvax.BITNET (Richard Kurtz) (03/25/88)

Mark, I believe the closest you could come to simulating the C union
statement would be using a Variant record.  I know this involves an
extra identifier, but I don't know any way around this.  I am currently
trying to write a program that translates C procedures to Modula2
procedures and that is how I chose to translate the C union type.
   I am interested in your project as it might relate somewhat to
what I am doing.  Could you send me more information on what you are
doing, or could you send your code?  thanks.  -Richard Kurtz

--
INTERNET: NCKURTZ%NDSUVAX.BITNET@wiscvm.wisc.edu
BITNET:   NCKURTZ@NDSUVAX
UUCP: ...psuvax1!NDSUVAX.BITNET!NCKURTZ   or ...ihnp4!umn-cs!ndsuvax!nckurtz

cornwell@NRL-CSS.ARPA (Mark Cornwell) (03/26/88)

Richard,

Thanks for your interest.  I'm afraid our project won't be of as much interest
to you as it may have appeared.  We will be writing a secure message system
on top of a variant of the UNIX operating system.   The system will be written
mostly in Modula-2.  It is important that the application call the UNIX
 interface
directly.  Our modula-2 compiler did not come with an interface to UNIX, just
a set of libraries for I/O, files, etc.

I spent the last few days, writing a module of `glue' routines that handle
the difference between the calling conventions of the Modula-2 compiler and
the C-libraries so that I can write stubs in Modula to call the C-functions.
These stubs have to push parameters on the run-time stacks, stash return values
in registers and things like that.  That is what the procedures do.  In addition
the interface needs a set of type declarations that coorespond to the types
one would find in the header files of the UNIX system interface.  It was the
writing of these header files that prompted my question.

In the end I found a pretty solution to the grubby coding part.  I wrote some
m4 macros that take a one line description of the function and its parameters
and generate the linkage code, computing the proper calling sequence from
the sizes of the parameters, the type of the functions, etc.  It's like a
mini version of a code generator as you might find in a compiler.  I'm pretty
proud of it.  It shrinks to two pages of code what would have taken me 40 or
50 to write if I'd taken the brute-force approach.

E.g., it lets me write:

    cgen(chown,INTEGER,path,StringPtr,owner,INTEGER,group,INTEGER)

And the cgen macro will expand to:

   PROCEDURE chown ( path: StringPtr; owner: INTEGER; group: INTEGER ) :
 INTEGER;
   VAR
      adr:ADDRESS;
   BEGIN
     (* Push one word parameter: group *)
       SETREG(AX,group);       (* ax <- amode *)
        CODE(50H);              (* push ax *)

     (* Push one word parameter: owner *)
       SETREG(AX,owner);       (* ax <- amode *)
        CODE(50H);              (* push ax *)

     (* Push two word parameter: path *)
    adr := path
        SETREG(AX,adr.SEGMENT);
        CODE(50H);              (* push ax *)
        SETREC(AX,adr.OFFSET);
        CODE(50H);              (* push ax *)

     (* Call C-library routine: chown *)
    SetCLibDS;               (* set DS to correct value *)
        EXTCALL("_chown");         (* call "_chown" *)

     (* Pop the 4 words pushed off the stack *)
    CODE(83H,C4H,08H);        (* add sp,8 *)

     (* Move the one word function result from AX to BX *)
    CODE(89H,D8H);             (* BX <- AX *)

   END chown;


The macro has to be clever enough to know the sizes of the arguments and
whether the function returns a value and well as the size of the return
value.  The calling conventions vary with respect to all of these.

My implementation module contains about a hundred or so lines that are just
calls to cgen.

Unfortunately, all of this is just incidental to the project.  It was
an enjoyable distraction for the last few days and is pretty much finished
now.

If you are still interested, I can send you the code.

--Mark

R_Tim_Coslet@cup.portal.com (03/26/88)

>I'm trying to build a Modula-2 interface to a C-library an have hit
>a snag.  Can anyone construct a suitable translation of:
>
>    struct obj_name {
>       int obj_type;
>       union {
>          int id;
>          char *path;
>       } obj_id;
>    };

The direct equivalent of the above C in Modula-2 is....

	ObjName : RECORD
		    ObjType : INTEGER;
		    CASE : BOOLEAN OF
		      TRUE : id : INTEGER |
		      FALSE : path : POINTER TO CHAR |
		    END
		  END


This should creat the same data structure (and actually has one less
identifier: obj_id). The key is that the TAG identifier is optional
in the Variant CASE statement (alot of people never notice this!!!).

I have not yet done any Modula-2 programming but I have been using
Pascal for over eight years (the above is also applicable in Pascal).

I verified this against Niklaus Wirth's book:

	Programming in Modula-2		Third, Corrected Edition

Check the syntax diagrams in Appendix 4 (starting on page 189)

schaub@sugar.UUCP (Markus Schaub) (03/28/88)

>     struct obj_name {		|	objName=RECORD
>        int obj_type;		|	  objType: INTEGER;
>        union {		|	  objId: RECORD
>           int id;		|	    CASE (* objType *):INTEGER OF
>           char *path;		|	    | 0:  id: INTEGER;
>        } obj_id;		|	    | 1:  path: POINTER TO CHAR
>     };			|	    END
>				|	  END
>				|	END
>				| 
> -- Mark Cornwell		| -- Markus Schaub

If obj_id is not used, you can simplify the RECORD to a single case-record.

-- 
     //	Markus Schaub			| The Modula-2 People:
    //	M2Amiga Developer		| Interface Technologies Corp.
\\ //   uunet!nuchat!sugar!schaub	| 3336 Richmond Ave. Suite 323
 \X/    (713) 523-8422			| Houston, TX 77098

paul@vixie.UUCP (Paul Vixie Esq) (03/28/88)

In article <4118@cup.portal.com> R_Tim_Coslet@cup.portal.com writes:
##Can anyone construct a suitable translation of:
##
##    struct obj_name {
##       int obj_type;
##       union {
##          int id;
##          char *path;
##       } obj_id;
##    };
#
#The direct equivalent of the above C in Modula-2 is....
#
#	ObjName : RECORD
#		    ObjType : INTEGER;
#		    CASE : BOOLEAN OF
#		      TRUE : id : INTEGER |
#		      FALSE : path : POINTER TO CHAR |
#		    END
#		  END
#
#
#This should creat the same data structure (and actually has one less
#identifier: obj_id).

Sometimes you *want* that intervening obj_id.  In C, it's harder (though
possible) to make a variant record where this intervening member needn't
be named in references to the variant fields; in M2, you can do it thus:

TYPE	ObjName = RECORD					(* note 1 *)
		ObjType: INTEGER;
		ObjId: RECORD
			CASE BOOLEAN OF				(* note 2 *)
				TRUE:   id: INTEGER|
				FALSE:  path: POINTER TO CHAR;	(* note 3 *)
			END
		END
	END;

Note 1: we are creating a type in the C example, not a variable.
Note 2: No ':' before the type as far as I know; [brackets] may be needed
	(I don't recall), and the type could be enumerated if more than
	two variants are needed -- BOOLEAN is convenient but not mandatory.
Note 3: POINTER TO CHAR is one way to represent strings, but sometimes arrays
	are used.  Sure would be great if open arrays were allowed in places
	other than a formal argument on a procedure...
-- 
Paul A Vixie Esq
paul%vixie@uunet.uu.net
{uunet,ptsfa,hoptoad}!vixie!paul
San Francisco, (415) 647-7023

alan@pdn.UUCP (Alan Lovejoy) (03/28/88)

In article <8803242203.AA28027@ndsuvax.UUCP> Info-Modula2 Distribution List <INFO-M2%UCF1VM.bitnet@jade.berkeley.edu> writes:
>Mark, I believe the closest you could come to simulating the C union
>statement would be using a Variant record.  I know this involves an
>extra identifier, but I don't know any way around this.  I am currently
>trying to write a program that translates C procedures to Modula2
>procedures and that is how I chose to translate the C union type.

Why does a variant record require an extra identifier?  The only extra
identifier I know of is OPTIONAL:

   TYPE

     TaggedVariant = 
       RECORD
	 CASE type: CARDINAL OF
	   0: foo: Foo;
	   |
	   1: bar: Bar;
         END;
       END;

     UntaggedVariant =
       RECORD
	 CASE CARDINAL OF
	   0: foo: Foo;
	   |
	   1: bar: Bar;
         END;
       END;

Variant records do not have to have a tag field.  What other extra
identifier could there be?  Did you mean the identifier that specifies
the type of the case labels (CARDINAL in the examples above)?  If so,
that identifier is a compile-time creature only--it has no existence 
at run time.

--alan@pdn

alan@pdn.UUCP (Alan Lovejoy) (03/31/88)

In article <850@vixie.UUCP> paul@vixie.UUCP (Paul Vixie Esq) writes:
>Sometimes you *want* that intervening obj_id.  In C, it's harder (though
>possible) to make a variant record where this intervening member needn't
>be named in references to the variant fields; in M2, you can do it thus:
>
>TYPE	ObjName = RECORD					(* note 1 *)
>		ObjType: INTEGER;
>		ObjId: RECORD
>			CASE BOOLEAN OF				(* note 2 *)
>				TRUE:   id: INTEGER|
>				FALSE:  path: POINTER TO CHAR;	(* note 3 *)
>			END
>		END
>	END;
>
>Note 1: we are creating a type in the C example, not a variable.

Who said otherwise?  The Modula-2 examples I have seen in this
discussion were all type definitions, weren't they?

>Note 2: No ':' before the type as far as I know; [brackets] may be needed
>	(I don't recall), and the type could be enumerated if more than
>	two variants are needed -- BOOLEAN is convenient but not mandatory.

You are both wrong and right:  the original syntax for Modula-2 did not
have a colon before the type of a tagless variant.  Most compilers
still support this syntax (usually as the only option).  However, Wirth
changed the syntax in the third edition of his book (PIM2e3) making the
colon required.  

>Note 3: POINTER TO CHAR is one way to represent strings, but sometimes arrays
>	are used.  Sure would be great if open arrays were allowed in places
>	other than a formal argument on a procedure...

POINTER TO CHAR is a TERRIBLE way to represent strings (unless you hide
this representation behind an opaque type).  Why?

1) There is no guarantee that SIZE(aCharVariable) = SIZE(string[0])
(assuming the declarations: 
   VAR aCharVariable: CHAR; string: ARRAY [0..n] OF CHAR).

This is not just theoretical.  My 68k M2 compiler uses two bytes
for a character variable but one byte for each character in a 
string.  This breaks the following code:

  VAR

    cp, end: POINTER TO CHAR;
    string: ARRAY [0..n] OF CHAR;

  ...

  cp := ADR(string);
  end := base + String.Length(string);
  WHILE ADDRESS(cp) < ADDRESS(base) DO
    Process(cp^);
    cp := ADDRESS(cp) + TSIZE(CHAR);
  END;

Even if we replace TSIZE(CHAR) with Char.lengthInAString, we still run
up against the problem that the compiler thinks cp^ is a reference to
two bytes, not one.  So it emits object code such as MOVE.W, ADD.W,
CMP.W, etc, when it should be emitting MOVE.B, ADD.B, CMP.B, etc. 
Whether this results in erroneous behaviour depends on the byte sex
of the CPU (and the byte sex assumed in the algorithm).

On the 68k, this is even more serious BECAUSE WORD MEMORY ACCESSES MUST
OCCUR ONLY FOR EVEN ADDRESSES.  An odd effective address used with WORD
or LONGWORD data results in a processor-generated ADDRESS ERROR.

POINTER TO CHAR is not a portable way to represent strings.

2) When the programmer sees 'string: POINTER TO CHAR', there is vital
information about this object which is completely missing:

  a) How big is the string?
  b) Has 'string' been properly initialized to point either to NIL
     or to some string?
  c) Does 'string' point to an object on the heap (memory from the
     string was allocated using NEW or ALLOCATE), or does it point
     to an object on the stack (string := ADR(aStackVariable)).
     You wouldn't want to call DISPOSE or DEALLOCATE on 'string'
     if it points to a stack variable.
  d) How many other pointer variables reference the same object?
     You don't want to DEALLOCATE 'string' if there are still
     active references to it.

POINTER TO CHAR is not a safe way to represent strings.

3) Programmers normally expect to be able to reference the i'th
character in a string using array-index syntax:  string[i].
If string is POINTER TO CHAR, that's not possible. Better is
'VAR string: POINTER TO ARRAY [0..Char.maxArray] OF CHAR;'.
'Char' is a definition module containing useful system dependent
parameters describing the properties of characters and arrays of
characters.  Char.maxArray is the highest zero-based index that
the compiler will allow for an ARRAY OF CHAR.  This permits
access to the i'th element using traditional syntax: string^[i],
yet still provides for pointer arithmetic and dynamic sizing.
It also finesses the SIZE(CHAR) problem.

Even better is:

  TYPE

    DynamicStringIndex = [0..Char.maxArray];

    DynamicString = 
      RECORD
        size: DynamicStringIndex;
        base: POINTER TO ARRAY DynamicStringIndex OF CHAR;
      END;

Best is:

  DEFINITION MODULE DynamicString;

    EXPORT QUALIFIED
      STRING, Index, ...;  (* PRIVATE is NOT exported *)  

    TYPE

      Index = [0..Char.maxArray];
      PRIVATE;
      STRING =
	RECORD
	  size: Index;  (* read only variable *)
	  base: PRIVATE;
        END;

4) "Open arrays" that are not procedure parameters are possible but 
do not come cheaply.  Assume the following declarations:  

  VAR
    string10: ARRAY [0..9] OF CHAR;
    string80: ARRAY [0..79] OF CHAR;
    foo: Bar;
    dynamicString: ARRAY OF CHAR;
    i: CARDINAL;

When the block in which these declaraction reside is entered, the
statically size objects (everything but  'dynamicSring' can easily
be allocated on the stack.  But the size of 'dynamicString' is
undefined, so it cannot be allocated.  What can be allocated is
a hidded variable which will point to 'dynamicString', and a hidded
variable which will specifiy the size of 'dynamicString'.  Somewhere
in the block, a value may be assigned to dynamicString:

  dynamicString := string10;

It would be nice if we could allocate the memory for dynamicString
on the stack at this point.  If the usage of dynamicString is as
simple as this case is so far, we can.   The problem is how to 
allocate memory on the stack for multiple open arrays whose size
changes more than once during execution of the block (open array
procedure parameters don't have this problem because their size
is known at block entry and cannot change until block exit). 
When the size of an open array changes, the value returned by
ADR(anOpenArray) probably will have to change as well.  Alogirithms
that are valid for static arrays will likely break if the static arrays
are redefined to be dynamic open arrays.

There is no general solution to this problem except to allocate
memory on the heap and not the stack.  So the only thing generic open
arrays give us is the ability to write 'anOpenArray[index]' instead of 
writing 'aDynamicArrayAllocatedByTheProgrammer^[index]'.  We could get 
the same effect by slightly changing the syntax of the language so that
'a[i]' is recognized as shorthand for 'a^[i]'.  Oh yeah, the compiler
automatically allocates and deallocates for us.  Which completely
hides from the programmer the fact that these arrays are heap objects.
Which has both its good and bad points.

It's simpler (for the compiler writer) not to open this can of worms.
If you feel you really need this functionality, I suggest you try
Smalltalk, LISP or APL.

Personally, I'd like to see new syntax permitting variables to 
have their initialization and termination processing defined 
as part of their declaration.  Example:

VAR
  i: CARDINAL := 0;  (* initialize i to zero *)
  a: POINTER TO ARRAY [0..n] OF CHAR 
   := NEW('Hello, world.') (* initialize a to NEW('Hello, world.');
			      NEW should be a function which accepts
			      the initial value of the allocated
			      object as its optional argument *)
   := DISPOSE(a); (* on termination of the block, assign DISPOSE(a) to a;
		     DISPOSE should also be a function *)
  x: REAL
   := 3.14159  (* initialize x to pi *)
   := circumference / (2.0 * radius);  (* on block exit, set x to be
					  the value of this expression *)
  circumference: REAL := 0.0;
  radius: REAL := 1.0;

The block termination code would execute just before the expression
following a RETURN statement is evaluated, or else just before executing
a RETURN (if the block is not a function).  Notice that this can help
to guarentee that functions don't return dangling pointers.

Another suggestion would be to change the dynamic of pointer syntax
so that a reference to a pointer variable references its dynamic object
instead of the address of its dynamic object:

  VAR

    p: POINTER TO FooBar;
    a: ADDRESS;
....

  p := aFooBar;   (* old syntax: p^ := aFooBar *)
  a^ := ADR(p);   (* old syntax: a := p *)
  a^ := p^;       (* old syntax: a := p *)

This makes it possible to abstract over an algorithm so that it is
valid either for pointers or non-pointers.  It's analogous to VAR
and VALUE parameters for procedures which make it possible to abstract
procedure calls with respect to arguments being passed as addresses
or as values.

--Alan@pdn