[net.math.stat] Interface from S to GLIM

bates@uwstat.UUCP (02/06/85)
The initial distributions of the S System included a function to use the
`GLIM' package to fit generalized linear models. For those of you not
familiar with GLIM, the generalized linear models are different from
general linear models as implemented in, say, SAS. See the documentation
below. One disadvantage of this interface was that it required a
modified version of the GLIM program. What follows is a GLIM interface
that will work with the standard version of GLIM as distributed by the
Numerical Algorithms Group (NAG) in Britian.

I apologize for sending almost 600 lines of code in this news group but
I thought that this code was so specialized (it is meaningless to people
who do not have the S System) that it did not belong in net.sources. If
it becomes a more widespread practice to post sources for S functions to
the net (something I would encourage) we might consider forming a
net.math.stat.sources newsgroup.


Here is a shell archive for the interface to glim.  The line containing
"excmd" in file glim.i is correct if the executable version of glim
resides in $SHOME/cmd/glim.  If it is somewhere else, the line should
be changed to include an appropriate path name for the glim command.

------- cut here -------------------

# To unbundle, sh this file
cat >Smakefile <<'//GO.SYSIN DD *'
all: glim glim2

glim: inter1.o glim.x glstdr.o
	f77 $(LDFLAGS) $(STRIP) -o x/glim inter1.o glim.x glstdr.o $(LIBR)
	@touch glim
	@echo glim loaded
inter1.C: ; echo 'INCLUDE(u/cinter)CINTER(glim)' >inter1.C

glim2: inter2.o glim2.x
	f77 $(LDFLAGS) $(STRIP) -o x/glim2 inter2.o glim2.x  $(LIBR)
	@touch glim2
	@echo glim2 loaded
inter2.C: ; echo 'INCLUDE(u/cinter)CINTER(glim2)' >inter2.C
//GO.SYSIN DD *
cat >glim <<'//GO.SYSIN DD *'
.BG
.FN glim
.TL
glim: Interface to the GLIM system
.CS
glim(var1, var2, ..., y=, error=, link=, offset=, fit=, weight=,
     den=, pwr=, names= )
.PP
The `glim' function provides the interface to the GLIM system, see
below, for analysis of generalized linear models.
.AG vari
the variates and factors (in GLIM terminology) to be used in the fit.
These should all be vectors (matrices) of the same length
(number of rows). If no `y' argument is given, the last of
the variates is taken to be the response. The name `vi' is
given to the ith variate; columns of matrices count as
variates.
Categorical variables will be passed to GLIM as factors,
anything else is passed as a variate.
.AG y=
optionally, the data for the dependent variable (y). 
.AG error=
optional lower-case character value for the `$error' directive for GLIM.
.AG link=
optional lower-case character value for the `$link' directive for GLIM.
.AG offset=
optional value for the `$off' directive for GLIM.
.AG fit=
optional lower-case character value for the `$fit' directive for GLIM.
If `fit' is missing, the default model will be
"v1+v2+...+vn", where `vi' is the ith variate name.
.AG weight=
optional vector of weights for the `$weight' directive for GLIM.
.AG den=
optional vector of  binomial denominators
(i.e., number of trials) when the binomial error
structure is used.
.AG pwr=
the exponent when a power transformation link is used.
.AG names=
optional, vector of variate names to be used in the fit
description.  These must be unique in 4 characters, and be
composed of lower case letters and digits (GLIM
restrictions).  If omitted, names "v1", "v2", ... will be
used.
The `y', `den', `weight' and `offset' variates are always named "y"
"den", "wght" and "offs" respectively.
.PP
.RT
a data structure describing the result of the GLIM fit.  The
arguments `error' through `pwr' are included in the results,
along with the following.
.RC dv
the value of the deviance (likelihood ratio goodness-of-fit) for the model.
.RC x2
the value of the "chi-squared" statistic for the model (see the GLIM Manual
for a definition).
.RC coef
the coefficients of the fitted model.  Note that in the case
of overdetermined models (e.g., analysis of variance) the
parameterization is that of GLIM, which is somewhat
unconventional in that it makes the first effect zero,
rather than the sum of the effects.
Parameters set to zero in this way are returned as missing values,
to distinguish them from honest zeros.
.RC fv
the vector of fitted values from the model.
.RC vc
the variance-covariance matrix.
.RC vl
the variances of the linear predictors.
.RC wt
the final values of the weights produced by GLIM's iteratively
reweighted least-squares algorithm.
.RC labels
the labels associated with the elements of `coef', made up
of the names of the variables or categories, plus level
numbers for the categories.
.RC names
the names of the variables passed to GLIM as predictors.
.PP
Files `glim.in' and `glim.ot' will appear in the current directory.
These files contain the input given to GLIM and the resulting GLIM output.
It may be necessary to look at these files to track down problems.
.EX
glim(a,b,c,resp,error="p",weight=w1,fit="v1*v2+v3")
.SH REFERENCES
Baker, R. J. and Nelder, J. A. (1978) `The GLIM System'.
Numerical Algorithms Group, 7 Banbury Road, Oxford, ENGLAND.
.PP
Chambers, J. M. and Schilling, J. M. (1981).
"S and GLIM: an experiment in interfacing statistical systems",
`Bell Laboratories Memorandum'.
.SH BACKGROUND
For those unfamiliar with GLIM, the following brief description may
get things started. For detailed documentation see the first
reference.
.PP
GLIM fits linear models, in which the linear predictor may be
transformed by the `link' function before being used to predict y.
The predictor is the sum of linear terms of real and categorical
variables.
The dependent variable is assumed to come from one of a standard 
set of distributions. 
If the predictor
variates are categories (`factors' in GLIM terminology), the GLIM
model will contain coefficients for all the levels of the factor
(except the first; see coef above).
If more than one factor appears in
the model, one may choose to fit also the `interaction' terms for two
or more factors.
.PP
The model used by GLIM is controlled by the `fit' directive.  This is
a character string containing the names of the factors, joined by "+",
"*" and ".".  These operators mean to include both the terms, both the
terms plus the interactions and only the interaction, respectively.
.PP
The commonly useful GLIM models are:
the analysis of variance (default `error' and `link');
loglinear models for contingency tables `(error="p",link="l")';
logit and probit models `(error="b",den=...,link="g"' and `link="p"').
Other values for the `link' argument are "r" (reciprocal), "c"
(complementary log-log), "s" (square root), "i" (identity) and
"e" (exponent).  The error may be "n" (normal), "p"
(poisson), "b" (binomial) or "g" (gamma); when binomial errors are
used, the denominator vector must be provided as `den'.  See the GLIM
Manual for interpretation of all this, for examples and for the use of
the `weight' and `offset' directives.
.KW basic*
.WR
//GO.SYSIN DD *
cat >glim.i <<'//GO.SYSIN DD *'
#S function to construct directives for the glim system.
# user of S function provides as many data sets as desired, plus
# optional settings for the various glim model-defining directives.
# V 1.20 of 82/04/16

#	length of GLIM's input line - must correspond to length
#	assigned to variable ncir5 in setup.f

define(`GLIMWIDTH',`120')

#	symbol used by GLIM to denote end-of-record
#	used here to keep GLIM from spending inordinate amounts of time
#	scanning long lines
#	definition must correspond to initialization of variable eprsym
#	in setup.f

define(`EPRSYM',`!')

#	to make string comparisons easier to read by cutting out TEXT
#	and TSTRING glop

define(`STREQ',`streq(TEXT($1),TSTRING($2))')

FUNCTION glim(
	y=	/REAL,OPTIONAL/
	error=	/CHAR,1/
	link=	/CHAR,1/
	offset=	/REAL,OPTIONAL/
	fit=	/CHAR,1,STRING(%gm)/
	weight=	/REAL,OPTIONAL/
	den=	/REAL,OPTIONAL/
	pwr=	/REAL,1/
	names=	/CHAR,OPTIONAL/
	&
	)

# count variates (vectors + columns of matrices) for names dimension
nvar=4	# allow for optional weight, offset, denominator, and y 
ALLARG(variate)
	COERCE(variate/MATRIX,NAFATAL/)
	nvar=nvar+NCOL(variate)
NEXTARG

STRUCTURE( pdata/INT,nvar/, nlev/INT,nvar/, vnames/CHAR,nvar/)
STRUCTURE( ldata/INT,nvar/ )
STATIC( integer otfile,sattac,infile,nvar,nobs,nfac,i,nwt,noff,nden )
STATIC( integer digit,fitlow,fitlen,namlen )
STATIC( POINTER glname,dsname )
STATIC( CHARACTER(jgetch,1);CHARACTER(ch,1);logical streq,letter )

nvar=0; nfac=0
# check whether args are factors, and get names and values
ALLARG(variate)
	COERCE(variate/MATRIX,NAFATAL/)
	STRUCTURE(Label/FROM(variate),CHAR,OPTIONAL/)
	for(j=0;j<NCOL(variate);j=j+1){
		nvar=nvar+1
		pdata[nvar]=VALUE(variate)+(j)*NROW(variate)
		ldata[nvar]=NROW(variate)

		if(!MISSING(Label)) { # a factor
			nlev[nvar]=LENGTH(Label)
			nfac=nfac+1
			}
		else nlev[nvar]=0

		if(MISSING(names) || (nvar > LENGTH(names) ))
		{ # make up names if not given
			ENCODE("v",I(nvar,0))
			vnames[nvar]=istrng(BUFFER,BUFPOS)
			CLEAR
		}
		else vnames[nvar]=names[nvar]
		}
NEXTARG
if(!MISSING(y)) {
	nvar=nvar+1;pdata[nvar]=VALUE(y)
	ldata[nvar]=LENGTH(y)
	vnames[nvar]=STRING(y)
}
nyvar=nvar
if(nyvar<1) FATAL(No data supplied)

#	put in weight, offset and binomial denominator as the next variables

if(!MISSING(weight)) { #put in as next variable
	nvar=nvar+1; nwt=nvar; pdata[nvar]=VALUE(weight);vnames[nvar]=STRING(wght)
	ldata[nvar]=LENGTH(weight)
	}

if(!MISSING(offset)) { #put in as next to last variable
	nvar=nvar+1; noff=nvar; pdata[nvar]=VALUE(offset); vnames[nvar]=STRING(offs)
	ldata[nvar]=LENGTH(offset)
	}

if(!MISSING(den)) { #put in as last variable
	nvar=nvar+1; nden=nvar; pdata[nvar]=VALUE(den); vnames[nvar]=STRING(den)
	ldata[nvar] = LENGTH(den)
	}

nobs = ldata[1]
for(i=2; i<=nvar; i=i+1) { # check that everything has the same length
	if( ldata[i] != nobs )
		FATAL(all variables must have the same number of observations)
	}

otfile=sattac(TSTRING(glim.in),WRITE,AUTO) #file for glim directives, data

FPRINT(otfile,"$units ",I(nobs),"EPRSYM")

if(nfac>0) {	#print FACTOR directive
	FPRINT(otfile,"$factor","EPRSYM")
	do i=1,nvar
		if(nlev[i]>0)
			FPRINT(otfile,C(TEXT(vnames[i])),I(nlev[i]),"EPRSYM")
	}

	# print $data directive giving names of variables to
	# be initialized

FPRINT(otfile,"$data ",I(nobs),"EPRSYM" )
do i=1,nvar
	FPRINT(otfile,C(TEXT(vnames[i])),"EPRSYM" )

	# now set up read directive, followed by the data

FPRINT(otfile,"$read","EPRSYM")
for(i=0; i<nobs; i=i+1) { #write out the data, by observation
	for(j=1; j<=nvar; j=j+1) {
		if(BUFPOS>GLIMWIDTH-FWIDTH)FPRINT(otfile)
		ENCODE( R(rs(pdata[j]+i)))
		}
	FPRINT(otfile,"EPRSYM")
	}
FPRINT(otfile,"$yvar ",C(TEXT(vnames[nyvar])),"EPRSYM")

	# set error, if required.  For binomial error, check that
	# a denominator vector is present

if(!MISSING(error)){
	if(MISSING(den)) {
		if(STREQ(error,b))
			FATAL(must supply denominator vector)
		else{
			FPRINT(otfile,"$error ",C(TEXT(error)),"EPRSYM")
			RETURN(error,&)
		}
	}
	else{
		if(!STREQ(error,b))
			FATAL(denominator vector allowed only for binomial models)
		else{
			FPRINT(otfile,"$error ",C(TEXT(error))," ",
				C(TEXT(vnames[nden])),"EPRSYM")
			RETURN(error,den,&)
		}
	}
}

	# set link, if required.  If possible, check for consistency with
	# error.

if(!MISSING(link)){
	if(MISSING(pwr)){
		if(STREQ(link,e))
			FATAL(must supply power)
		else{
			FPRINT(otfile,"$link ",C(TEXT(link)),"EPRSYM")
			RETURN(link,&)
		}
	}
	else{
		if(!STREQ(link,e))
			FATAL(power allowed only for exponent link)
		else {
			FPRINT(otfile,"$link ",C(TEXT(link))," ",R(pwr),
				"EPRSYM")
			RETURN(link,pwr,&)
		}
	}

		# check error/link validity

	if(!MISSING(error)) {
		if( STREQ(error,b) ) {
			if(!( STREQ(link,g) || STREQ(link,p) ||
				 STREQ(link,c)
			)) {
				FATAL(link not valid for binomial model)
			}
		} else {
			# doing the following is messier than flagging only
			#	g p and c links, but it ensures we catch garbage
			if(!( 
				STREQ(link,i) || STREQ(link,l) ||
				STREQ(link,r) || STREQ(link,s) ||
				STREQ(link,e)
			))
				FATAL(link not valid for non-binomial model)
		}
	}
}

	# output $weight and $offset directives if required

if(!MISSING(weight)){FPRINT(otfile,"$weight ",C(TEXT(vnames[nwt])),"EPRSYM")
			RETURN(weight,&)}

if(!MISSING(offset)){FPRINT(otfile,"$offset ",C(TEXT(vnames[noff])),"EPRSYM")
			RETURN(offset,&)}

	# output $fit directive if required

if(MISSING(fit) & nyvar>1) {	# if vars, fit them all

	#	first, determine length of model string
	#	since model string may be more than 256 characters
	#	long, don't use the buffer for this

	fitlen = 0
	for( i=1; i<nyvar; i=i+1 ) {
		fitlen = fitlen + islenz(TEXT(vnames[i])) + 1
		}
	fit = jstkgt(fitlen,CHAR)

	#	now construct string

	fitlen = 0
	for(i=1; i<nyvar; i=i+1) {
		namlen = islenz(TEXT(vnames[i]))
		call concat(TEXT(fit), fitlen+1, TEXT(vnames[i]), 1,
				namlen )
		fitlen = fitlen + namlen + 1
		call concat(TEXT(fit), fitlen, "+", 1, 1)
		}
	call concat(TEXT(fit), fitlen, "EOS", 1, 1) # this wipes out
						    # hanging "+"
	}
CLEAR
FPRINT(otfile,"$fit","EPRSYM")

#	now write model out.  Since model may be more than GLIMWIDTH
#	characters long, print it in chunks

fitlen = islenz(TEXT(fit))
fitlow = 1
while( (fitlen - fitlow) >= GLIMWIDTH )
{
	#	more than GLIMWIDTH characters - look for something
	#	that is not an identifier as place to split string

	i = fitlow + GLIMWIDTH - 1
	ch = jgetch(TEXT(fit),i)
	while( letter(ch) || (digit(ch) >= 0) )
	{
		#	identifier (letter or digit) - back up one
		#	but don't back up over end of string

		i = i - 1
		if( i < fitlow )
			FATAL(Identifier more than GLIMWIDTH characters long)

		ch = jgetch(TEXT(fit),i)
	}
	#	print a chunk

	call concat(BUFFER, BUFPOS+1, TEXT(fit), fitlow, i-fitlow+1)
	BUFPOS = BUFPOS + i - fitlow + 1
	FPRINT(otfile)
	fitlow = i + 1
}

#	write out last chunk of text

call concat(BUFFER, BUFPOS+1, TEXT(fit), fitlow, (fitlen-fitlow+1) )
BUFPOS = BUFPOS + (fitlen-fitlow+1)
FPRINT(otfile)
RETURN(fit,&)

if(nyvar>1) {LENGTH(vnames)=nyvar-1;RETURN(names=vnames,&)} # names retd if vars given

call glstdr(otfile) #print standard glim directives
call sdetac(otfile)
call excmd(TSTRING(S glim <glim.in >glim.ot),0) #exec GLIM
CHAIN(glim2)
END
//GO.SYSIN DD *
cat >glim2.i <<'//GO.SYSIN DD *'
#S function to read the output file from the S glim
# function, which will contain formatted output from
# the glim program, and make S structures from it
# to return to the user.
# V 1.11 of 82/04/16
FUNCTION glim2(&)
INCLUDE(read,attach)
STATIC(CHARACTER(beglin,40))
STATIC(CHARACTER(jgetch,1))
STATIC(logical streq;integer nu,pl,ml,sattac,infile)
STATIC(real dev,chisq,rdf,rnu,rpl,rml)
infile=sattac(TSTRING(glim.ot),READ,AUTO)
repeat{
	GETLINE(infile)
	if(EOF) FATAL(probable GLIM error: see file glim.ot)
	if(LASTCH<7) next
	DECODE(S(beglin,40))
	do i=2,6{ #check for GLIM errors
		if(jgetch(beglin,i)!="*") break
		}
	if(i==6){ # copy the GLIM  messages
		EPRINT(C(INBUF,LASTCH))
		GETLINE(infile)
		EPRINT(C(INBUF,LASTCH))
		}
	if(streq(beglin,TSTRING(*output)))break
	}
 
#	extract information output by print statements in glstdr.r

FREAD(infile,R(rnu),R(rdf),R(rpl),R(rml),R(dev),R(chisq))
nu=rnu; pl=rpl
STRUCTURE(dv/REAL,1,dev/,
	x2/REAL,1,chisq/,
	df/INTEGER,1/,
	coef/REAL,pl/,
	fv/REAL,nu/,
	vl/REAL,nu/,
	wt/REAL,nu/,
	vc/MATRIX,pl,pl/,
	labels/CHAR,pl/)
df = rdf
do i=1,nu
	DECODE(R(fv[i]))
do i=1,nu
	DECODE(R(wt[i]))
do i=1,nu
	DECODE(R(vl[i]))
do j=1,pl{
	do i=1,j
	DECODE(R(vc[j,i]))
	}
do j=2,pl{
	do i=1,j-1
	vc[i,j]=vc[j,i]
	}
# fill the vector containing names of coefficients, and coefficient
# estimates.  First, look for the work "estimate" in the output
repeat{
	GETLINE(infile)
	if(EOF)FATAL(probable GLIM error: see file glim.ot)
	if(LASTCH<6)next
	DECODE(S(beglin,40))
	if(streq(beglin,TSTRING(estimate)))break
	}
for(i=1; i<=pl; i=i+1) {
	GETLINE(infile)

		# get value of coefficient

	AUTO_CHECK = FALSE
	DECODE(I(idummy),R(coef[i]))
	if(EOF)FATAL(Missing parameter estimates - Probable GLIM Error: see file glim.ot)

	if(FIELD_ERROR){
			# if we couldn't read the value of the coefficient
			# assume that it was aliased

		NASET(coef[i])
		AUTO_CHECK = TRUE
		FIELD_ERROR = FALSE
		DECODE(S(beglin,10),S(beglin,10))	#eat words zero and aliased
	}
	else
	{
		AUTO_CHECK = TRUE
		DECODE(R(rdummy))	# throw away standard error
	}
	if(EOF)FATAL(Probable GLIM Error: see file glim.ot)

	DECODE(S(beglin,40))		# get name of parameter

	if(EOF)FATAL(Probable GLIM Error: see file glim.ot)
	if(streq(beglin,TSTRING(%gm)))labels[i]=istrng("intercept",9)
	else labels[i]=istrng(beglin,-1)
	}
call sdetac(infile)
RETURN(dv,df,x2,coef,fv,vc,vl,wt,labels,FILTER)
END
//GO.SYSIN DD *
cat >glstdr.r <<'//GO.SYSIN DD *'
ROUTINE(glstdr,output a standard set of GLIM directives)
subroutine glstdr(otfile)
integer otfile
#these are the canned glim directives which allow the S function
#glim2 to read back the results of a glim fit
# V 1.6 of 82/04/16

INCLUDE(print)

define(`LENGTH',50)define(`NLINE',4)
CHARACTER(dirct,LENGTH,NLINE)

data dirct/"$ext %vc %vl $print '*output'",
"$print *9 %nu %df %pl %ml %dv %x2 %fv %wt %vl %vc",
"$dis e", "$stop"/

do i = 1,NLINE
	FPRINT(otfile,C(dirct(i),LENGTH))
return
end
//GO.SYSIN DD *
-- 

Doug Bates @ wisconsin
ARPA:	bates@wisc-stat.arpa
	bates@uwisc   (if you have the old host table)
UUCP:	...!{allegra,heurikon,ihnp4,seismo,sfwin,ucbvax,uwm-evax}!uwvax!bates