[net.math.stat] Some topics I wouldn't mind discussing

jt@nrcvax.UUCP (Jerry Toporek) (09/20/85)

This newsgroup has been rather quiet.  If there are folks out there,
I'd like to hear what you are doing, or would like to be doing, with
regards to a number of topics related to statistical software.  This 
isn't a survey, so don't feel obliged to answer everything.  Pick out
whatever seems important to you.  Here goes:

Are you generally happy with the available statistical software in your
computing environment?  Are UNIX people using S?  Is it really what you
want, and, if so, for what types of applications?  What else is being used
in the UNIX world?  

Are your data management tools adequate?  Do they provide the kind of 
operating environment you want?  Do data analysts still basically prepare
commands and submit them to a background process, or do they prefer some kind
of interactive operation?

How are the statistical packages performing on the IBM PC?  Do you prefer the
older, bigger, major statistical packages which have been made to run on the
PC, or the newer packages produced specifically to run in the small machine
environment?  Is there a package which combines the best features of both
types of package?  What are those features?

Are people starting to use smaller machines for local computing and large
machines for data storage?  Are there tools available to support distributed
computing and data management?  Do you want them?

Let me interrupt this line of questioning to say that my interest in all this
stems from the fact that most of my professional career has been spent 
developing statistical software, but the past year has been spent entirely
in development of networking software.  The switch came, in part, from a 
belief that statistical software of the future will be built on top of tools
providing access to resources within a network environment.  Data storage
on the machine with the big disks, number crunching from the array processor,
data collection direct from the lab equipment or production line sensors,
print service down the hall on the machine with the laser printer, etc. etc.,
all at my disposal on my little machine under my desk which couldn't hope to
do all that by itself.

Anyone else think this is the way to go?  Are we still recovering from the
dramatic shortages of card readers?  Enough for now?



-- 
	Jerry Toporek
	{sdcsvax,hplabs}!sdcrdcf!psivax!nrcvax!jt
	ucbvax!calma!nrcvax!jt

perlman@wanginst.UUCP (Gary Perlman) (09/25/85)

I am compelled by unknown forces to do this every year,
I guess because people thank me for it.

Since 1980, I have been distributing a small statistics
package called UNIX|STAT, so called because it was developed on
UNIX and uses pipelines a lot; it is a very UNIX style package.
Thanks to a lot of grundgy work by Fred Horan at Cornell,
the Lattice C compiler, and continuing education in portabilty,
most of the programs have been ported to MSDOS on the IBM PC.
I am not yet ready to distribute the programs on floppies for
MSDOS, but more than one site has been able to take the sources
I distribute and compile them for MSDOS with other C compilers.
Over the next few months, I will be doing V&V work on the MSDOS
versions and find some floppy-copy house to make copies.

So, what is UNIX|STAT?  Well, it's not comprehensive, but there
are a lot of good programs in it.  They are described below.
More programs are likely in the next year.  Some people have sent
me code (that I have not yet had time to incorporate) for non-
parametrics, and I am working on a multi-factor crosstabs/chi-square.

People seem to like UNIX|STAT because it integrates with UNIX
naturally, reading the standard input and writing the standard output.
It even has documentation: tutorials, manual entries, and I have
even made a video tape introduction (although the tape has not been
distributed with the package).  It is also cheap: $20 gets you a mag tape,
or you can send me a 600 foot mag tape and prepaid return mailer
and get it free.  This, obviously, is public domain software.

If you send me your postal address, I can send you more documentation.
Now for details.

Note: if you are using UNIX|STAT 5.0, there is nothing new here.

                            UNIX|STAT 5.0
                    COMPACT DATA ANALYSIS PROGRAMS


     UNIX|STAT is a set of UNIX System data manipulation and analysis
programs developed at the University of California, San Diego by Gary
Perlman (now teaching at the Wang Institute of Graduate Studies).  The
programs are designed with the UNIX System philosophy that individual
programs should be designed as tools that do one task well and produce
output suitable for input via pipes to other programs.  Interactive
use is supported in the UNIX System shell which also provides a
programming language for complex analyses.  Typical usage involves a
pipeline of transformations of data followed by input to an analysis
program, summarized schematically by:

          INPUT DATA | TRANSFORM | ANALYSIS | OUTPUT RESULTS

Functionality often built into statistical packages (e.g., graphics,
sorting and other data manipulation) is not re-invented in UNIX|STAT
which delegates such responsibility to standard UNIX System tools.

FEATURES

     easy to use (negligible training period)
     simple input formats (free format field oriented)
     used in pipelines with other UNIX System utilities (sort, vi)
     flexible data manipulation
     data validation provided (range and type checking)
     full documentation support (manual entries, tutorials)
     extensible (many modular C functions)
     faster than most packages (usually less than a second per analysis)
     small enough for micros (10-25K byte programs)
     runs on any UNIX System (V6, V7, 2.8BSD, 4BSD, III.0, System V, others)
     public domain software (can't be distributed for gain)
     in use at more than 300 UNIX System sites for five years

CHANGES FOR RELEASE 5.0 (March 5, 1985)

     reworked to increase portability, reliability, and usability
     all commands now use a standard option parser (getopt)
     all calculations are now done in double precision
     diagnostic error messages have been improved
     regress now does a partial correlation analysis
     colex and trans were added as alternatives for dm
     F ratio probabilities are now better approximated
     some inefficient input was optimized
     some non-portable features of C were replaced so that
     the programs now run under MSDOS on the IBM PC
     the random number seeding has been improved
     all programs now use a zero exit status on success
     version control was added--we are now at release 5.0

UNIX|STAT is Public Domain

     The programs have been released to the public and are distributed
to anyone who wants them.  Persons wanting to get a copy of the
package should contact me directly.  You can get the package for free
if you send me a tape and a self-addressed prepaid return mailer.  Or
you can send me personally $20 US to cover the costs of a tape and mailing.

The distribution includes:

     The C source files for all the programs.
     The documentation source files.
     A collection of test examples.

Contact:

     Gary Perlman
     Wang Institute of Graduate Studies
     Tyng Road
     Tyngsboro, MA 01879 USA
     (617) 649-9731
     uucp:     decvax!wanginst!perlman
               sdcsvax!sdcsla!perlman
     csnet:    perlman@wanginst
     arpa:     sdcsla!perlman@nprdc

NOTES:

     UNIX|STAT is unsupported, though known bugs have been removed.
     UNIX|STAT may not be distributed for profit.
     UNIX|STAT is NOT a product of any company or organization.
     UNIX|STAT is distributed on a `` use-at-your-own-risk basis.''


UNIX|STAT(1)           UNIX User's Manual            UNIX|STAT(1)

NAME
     UNIX | STAT - compact data analysis programs

DESCRIPTION
     UNIX | STAT is a set of data manipulation and analysis pro-
     grams developed at the University of California, San Diego.
     The programs are designed with the UNIX System philosophy
     that individual programs should be designed as tools that do
     one task well and produce output suitable for input via
     pipes to other programs.  Interactive use is supported in
     the UNIX System shell which also provides a programming
     language for complex analyses.  Functionality often built
     into statistical packages (e.g., graphics, sorting and other
     data manipulation) is not re-invented in UNIX | STAT which
     delegates such responsibility to standard UNIX System tools.

     DATA TRANSFORMATION PROGRAMS
          abut           join data files
          colex          column extraction
          dm             column oriented data manipulator
          io             control and monitor input and output
          maketrix       create matrix type file from free-form file
          perm           randomly permute lines in a file
          repeat         repeat a pattern or file
          reverse        reverse lines and characters
          series         print a series of numbers
          transpose      transpose matrix type file

     ANALYSIS PROGRAMS
          anova          multi-factor anova with repeated measures
          calc           interactive algebraic modeling calculator
          critf/pof      F-ratio/probability conversion functions
          dataplot       flexible data plotting
          desc           descriptions histograms, frequency tables
          dprime         signal detection d' and beta calculations
          oneway         one-way anova and t-test
          pair           paired data statistics, regression, plots
          regress        multivariate linear regression
          ts             time series analysis and plots
          validata       verify data file consistency
          vincent        time-series comparison

AUTHOR
     Gary Perlman (with the help of several others)

SEE ALSO
     sh(1), sort(1), uniq(1), sed(1), awk(1), grep(1), rm(1),
     cp(1), pr(1), ls(1), mv(1)
-- 
Gary Perlman  Wang Institute  Tyngsboro, MA 01879  (617) 649-9731
UUCP: decvax!wanginst!perlman             CSNET: perlman@wanginst

ronb@natmlab.OZ (Ron Baxter) (09/26/85)

In article <277@nrcvax.UUCP> jt@nrcvax.UUCP (Jerry Toporek) writes:
>
>Are you generally happy with the available statistical software in your
>computing environment?  Are UNIX people using S?  Is it really what you
>want, and, if so, for what types of applications?  What else is being used
>in the UNIX world?  
>
	On our system (4.2 BSD on a Vax 750) the main statistical
	packages available are:

	o  GLIM - an old favourite with good notation and
	   abilities for fitting models, but somewhat messy
	   output and a quirky syntax (e.g. sometimes you need to
	   have a $ at the end of a line to provoke action, and
	   sometimes you don't.)  Only needs a low once-off fee.

	o  MINITAB - users like it because it is easy to use.  It
	   has an annual fee (so is more expensive than GLIM) and
	   is distributed as a binary (so tough if you don't like
	   some of the decisions that have been made for you).

	o  GENSTAT - for getting ANOVAS for data from complex
	   designed experiments - it is better than the rest.  It
	   also has an annual fee (similar price to Minitab). It
	   is not seen as "easy to use", but it is quite powerful.
	   I have done a UNIX conversion of this package and it
	   is available from NAG.

	o  S - need I say more in this group.  It has the best
	   graphics facilities that we have for data analysts.
	   It is this that often gets users started on S, but
	   then they discover it can do more.  The fact that it
	   really is a practical proposition to add your own
	   algorithms in Fortran puts it way ahead of the others
	   which have limits that are more solidly defined.

	These are the main ones, we do have other more
	specialized packages, and libraries such as IMSL.


>Are your data management tools adequate?  Do they provide the kind of 
>operating environment you want?  Do data analysts still basically prepare
>commands and submit them to a background process, or do they prefer some kind
>of interactive operation?
>
	S and MINITAB are largely used interactively.  GENSTAT
	can be but usually isn't (people grew up using this in
	batch mode on CDC machines so ...).  GLIM is somewhere
	between being used interactively some of the time.

>Are people starting to use smaller machines for local computing and large
>machines for data storage?  Are there tools available to support distributed
>computing and data management?  Do you want them?
>
	I can see lots of scope for mmachines like the Microvax
	but we haven't moved far down this path yet.

>......................................  The switch came, in part, from a 
>belief that statistical software of the future will be built on top of tools
>providing access to resources within a network environment.  ............

	I agree that different facilities will be brought
	together by networks.  I also like the idea of this
	personal workstation being my window onto all this.
	However, at this stage I don't see a powerful enough
	workstation at a low enough price to start pushing the
	low-cost VDUs off everyones desks.

--
Ron Baxter,			ACSNET: ronb@natmlab
CSIRO Div Maths & Stats,	ARPA:   munnari!natmlab.oz!ronb@SEISMO.ARPA
National Measurement Lab.,	UUCP:   ...!seismo!munnari!natmlab.oz!ronb
PO Box 218, Lindfield, NSW,		
Australia, 2070.		PHONE:	+61 2 467 6059

hes@ecsvax.UUCP (Henry Schaffer) (09/26/85)

> 
> This newsgroup has been rather quiet.  
> 
> How are the statistical packages performing on the IBM PC? 
> -- 
> 	Jerry Toporek
  The big news around here is awaiting SAS for the PC.  It should run
on IBM and compatibles - requires 512k, and should take up about 5 Mb on
a hard disk.  It really sounds like a PC/AT is the desired configuration.
It will only be available on a site license basis, with their usual
sizeable discount for educational institutions.  It will come out in
several parts, and should be very compatible with the mainframe version.
--henry schaffer