[comp.unix.questions] random numbers in awk?

d3e608%pnli.pnl.gov@pnlg.pnl.gov (02/15/91)

I'm trying to write a csh script to extract a random multiline 
record from a file.  Awk does a nice job of getting the records.  Now how
do I choose one randomly?

Thanks,
Jeff Nye, K6-10            email:  d3e608%pnli.pnl.gov@pnlg.pnl.gov 
Battelle PNL               voice:  (509) 376-4571, days, PDT 
PO Box 999                 Richland, WA 99352         

jik@athena.mit.edu (Jonathan I. Kamens) (04/24/91)

In article <26019@adm.brl.mil>, d3e608%pnli.pnl.gov@pnlg.pnl.gov writes:
|> I'm trying to write a csh script to extract a random multiline 
|> record from a file.  Awk does a nice job of getting the records.  Now how
|> do I choose one randomly?

  The standard awk has no built-in way of getting a random number.

  However, the FSF version of awk, gawk, has a built-in rand() function that
returns a random number between 0 and 1.  You can multiply the return value by
the number of records to pick one of them randomly.

  If installing and using gawk is not an option, then another possibility is
to use the jot(1) program to generate a random number, if your system has jot.
You would call "jot -r 1 max" where "max" is the number of records from which
you are selecting.  Either call this in your script and then pass the result
into awk, or call it inside awk if you've got a version of awk that'll let you
call a program and pipe the results into getline.  If you don't have jot, you
can get it from a number of different ftp sites.  I found the source to it in
/help/dist/jot.c on lilac.berkeley.edu, and the man page in
/pub/Library/Computer/doc.4.3/ucs/jot.1 on ocf.berkeley.edu, using archie (if
you don't know how to use archie, see the article mentioned in the
instructions appended to the end of this message).

  Finally, if you've got access to a version of date(1) that will output the
date as a number of seconds since the beginning of Unix time, then you can
just take that number of seconds as the seed for a random number generator
written inside awk.  Your generator can be as symbol for taking the time
modulus the number of records from which you're choosing, or as complex as one
of the good random number generators mentioned in Knuth.

  There are probably other ways to get a random number, and other people will
point them out :-).

  Of course, if you were using perl instead of awk, you wouldn't need to ask
this question at all, since perl has a built-in function for getting random
numbers, and will have that function on any machine that supports perl.

-- 
Jonathan Kamens			              USnail:
MIT Project Athena				11 Ashford Terrace
jik@Athena.MIT.EDU				Allston, MA  02134
Office: 617-253-8085			      Home: 617-782-0710
-- 
        Subject: Information about finding sources (READ THIS BEFORE POSTING)
        Newsgroups: comp.sources.wanted,alt.sources.wanted

        Available via anonymous ftp from pit-manager.mit.edu (18.72.1.58)
        in the file

        /pub/usenet/comp.sources.wanted/Information_about_finding_sources_(READ_THIS_BEFORE_POSTING)

        Available from mail-server@pit-manager.mit.edu by sending a message
        containing

        send usenet/comp.sources.wanted/Information_about_finding_sources_(READ_THIS_BEFORE_POSTING)

        Send a message containing "help" to get general information about the
        mail server.

jik@athena.mit.edu (Jonathan I. Kamens) (04/24/91)

In article <1991Apr24.041134.14519@athena.mit.edu>, I write:
|> You would call "jot -r 1 max" where "max" is the number of records from which
|> you are selecting.

I should have said "jot -r 1 1 max", rather than "jot -r 1 max".  The former
tells jot to produce 1 number between 1 and max; the latter tells jot to
produce 1 number between max and 100, which is the default maximum value.

-- 
Jonathan Kamens			              USnail:
MIT Project Athena				11 Ashford Terrace
jik@Athena.MIT.EDU				Allston, MA  02134
Office: 617-253-8085			      Home: 617-782-0710

urban@cbnewsl.att.com (john.urban) (04/24/91)

> article <26019@adm.brl.mil>, d3e608%pnli.pnl.gov@pnlg.pnl.gov writes:
> I'm trying to write a csh script to extract a random multiline 
> record from a file.  Awk does a nice job of getting the records.  Now how

Several version of UNIX System V have nawk (new awk).  Regular awk does not
have the rand() function however nawk does.

To see if you version of UNIX has nawk, type in:

$ type nawk

If you've got nawk, then you can do something like:
	nawk 'BEGIN {print rand()}' < /dev/null
to get a random number on the screen

UNIX System V/386 Release 3.2 (and higher) also have the random(1) command
[Part of the XENIX System V and UNIX System V/386 merged product]

So you could use: random(1) although it doesn't appear to be total random.

I tried:
$ i=100
$ while [ $i -gt 0 ]
> do
> random 10
> i=`expr $i - 1`
> done | sort -n | uniq -c

  25 0
  10 4
  38 5
   3 6
  24 10


Sincerely,

John Urban

oz@yunexus.yorku.ca (Ozan Yigit) (04/24/91)

In article <1991Apr24.041134.14519@athena.mit.edu>
jik@athena.mit.edu (Jonathan I. Kamens) writes:

>  The standard awk has no built-in way of getting a random number.

You mean the "old" awk. The standard awk (sometimes known as nawk, and is
a part of most up-to-date UN*X distributions these days) as documented in
TheBook, has rand(). Accept no substitutes. ;-)

>  There are probably other ways to get a random number, and other people will
>point them out :-).

Sure, like one can program a good random number generator, even in old
awk. Here is the integer minimal standard random number generator[1] in
old awk. It has been tested for correctness.

BEGIN {
	a = 16807
	m = 2147483647
	q = 127773	# m div a
	r = 2836	# m mod a
	seed = 123	# use awk -f ... seed=<whatever>
}

{
	test = a * (seed % q) - r * int(seed / q);
	if (test > 0)
		seed = test;
	else
		seed = test + m;
	print rand = seed
}

>  Of course, if you were using perl instead of awk ...

But, perl [or c, icon, python ...] is *not* awk. ;-) 

oz
---
[1] Park, Stephen K. and Keith W. Miller, ``Random Number Generators:
Good Ones are Hard to Find'', Communications of the ACM 31 (10), 1988
pp. 1192-1201
---
A pencil is so far the most effective software | internet: oz@nexus.yorku.ca
design tool I have found...   -- Amanda Walker | uucp: utai/utzoo!yunexus!oz

nelson@berlioz.nsc.com (Taed Nelson) (04/25/91)

>  The standard awk has no built-in way of getting a random number.
>
>  However, the FSF version of awk, gawk, has a built-in rand() function that

The current version (1986) of AWK has rand() as well.  On the SunOS, it has
been renamed to be NAWK (New AWK).

guy@auspex.auspex.com (Guy Harris) (04/27/91)

>The current version (1986) of AWK has rand() as well.  On the SunOS, it has
>been renamed to be NAWK (New AWK).

SunOS didn't rename it; that name came from System V Release 3.1, which
offers the "old AWK" as "awk" and "oawk", and the "new AWK" as "nawk". 
The theory was, I guess, that people with scripts that depended on the
"old AWK" should change them to use "oawk", so that in the future "awk"
could become the "new AWK".  However, according to the S5R4
documentation, "awk" is still the old AWK....

gwyn@smoke.brl.mil (Doug Gwyn) (04/29/91)

In article <7453@auspex.auspex.com> guy@auspex.auspex.com (Guy Harris) writes:
>The theory was, I guess, that people with scripts that depended on the
>"old AWK" should change them to use "oawk", so that in the future "awk"
>could become the "new AWK".  However, according to the S5R4
>documentation, "awk" is still the old AWK....

We did this sort of thing with C compilers, too.  The idea is:
	PHASE 0 (existing practice):  All applications use "awk".
	PHASE 1:  New awk installed as "nawk", old awk remains as "awk",
		"oawk" linked to "awk".
	PHASE 2 (transition):  Applications that require the new
		behavior invoke "nawk".  Old applications are tested
		against "nawk" and if they really do require the old
		version, they are changed to invoke "oawk"; otherwise
		they continue to invoke "awk".
	PHASE 3: "awk" is changed to be a link to "nawk".
	PHASE 4 (catch-up):  Applications that rely on "oawk"-specific
		behavior are updated to work with regular (new) "awk".

martin@mwtech.UUCP (Martin Weitzel) (05/09/91)

In article <1991Apr24.041134.14519@athena.mit.edu> jik@athena.mit.edu (Jonathan I. Kamens) writes:
>
>  [...] another possibility is
>to use the jot(1) program to generate a random number, if your system has jot.

SysV User's should note that they have a similar program, called random.
It's range of output values is restricted to 0..255 (the upper limit must
be specified as a calling argument). If you need larger numbers, you
can, of course, call random twice, multiply the result of the first
call by 256, add the result of the second call, and take the whole thing
modulo the largest number you want. 
-- 
Martin Weitzel, email: martin@mwtech.UUCP, voice: 49-(0)6151-6 56 83

suhre@shark.dsd.trw.com (Maurice E. Suhre) (05/14/91)

In article <1136@mwtech.UUCP> martin@mwtech.UUCP (Martin Weitzel) writes:
+In article <1991Apr24.041134.14519@athena.mit.edu> jik@athena.mit.edu (Jonathan I. Kamens) writes:
+>
+>  [...] another possibility is
+>to use the jot(1) program to generate a random number, if your system has jot.
+
+SysV User's should note that they have a similar program, called random.
+It's range of output values is restricted to 0..255 (the upper limit must
+be specified as a calling argument). If you need larger numbers, you
+can, of course, call random twice, multiply the result of the first
+call by 256, add the result of the second call, and take the whole thing
+modulo the largest number you want. 
+-- 
+Martin Weitzel, email: martin@mwtech.UUCP, voice: 49-(0)6151-6 56 83

Be advised that the operations mentioned (multiplication, modulo) will,
in general, change the distribution of the random numbers.  That is, if you
started with uniformly distributed random numbers, then what you will
come out with will not be uniformly distributed.

-- 
Maurice Suhre
suhre@trwrb.dsd.trw.com

guy@auspex.auspex.com (Guy Harris) (05/15/91)

>SysV User's should note that they have a similar program, called random.

Well, *some* S5 users, anyway.  As I remember, it was in "/usr/games" in
S3 or in early S5; the games appear to have gone away in a later S5
release, alas.

>It's range of output values is restricted to 0..255 (the upper limit must
>be specified as a calling argument).

If you're thinking of the same "random" I'm thinking of (I'm thinking of
the one I picked up from some S5 release and put into SunOS), that's
only the case if you use the "-e" flag to get it to return a random exit
status.  Its "normal" mode is to act as a filter, randomly choosing
which lines of the standard input to write to the standard output....

gwyn@smoke.brl.mil (Doug Gwyn) (05/15/91)

In article <1991May13.221822.27731@gumby.dsd.TRW.COM> suhre@shark.dsd.trw.com.UUCP (Maurice E. Suhre) writes:
-In article <1136@mwtech.UUCP> martin@mwtech.UUCP (Martin Weitzel) writes:
-+If you need larger numbers, you can, of course, call random twice, multiply
-+the result of the first call by 256, add the result of the second call, and
-+take the whole thing modulo the largest number you want. 
-Be advised that the operations mentioned (multiplication, modulo) will,
-in general, change the distribution of the random numbers.  That is, if you
-started with uniformly distributed random numbers, then what you will
-come out with will not be uniformly distributed.

Be advised that the procedure Martin suggested does preserve uniformity.

suhre@ondine.dsd.trw.com (Maurice E. Suhre) (05/16/91)

In article <16154@smoke.brl.mil> gwyn@smoke.brl.mil (Doug Gwyn) writes:
+In article <1991May13.221822.27731@gumby.dsd.TRW.COM> suhre@shark.dsd.trw.com.UUCP (Maurice E. Suhre) writes:
+-In article <1136@mwtech.UUCP> martin@mwtech.UUCP (Martin Weitzel) writes:
+-+If you need larger numbers, you can, of course, call random twice, multiply
+-+the result of the first call by 256, add the result of the second call, and
+-+take the whole thing modulo the largest number you want. 
+-Be advised that the operations mentioned (multiplication, modulo) will,
+-in general, change the distribution of the random numbers.  That is, if you
+-started with uniformly distributed random numbers, then what you will
+-come out with will not be uniformly distributed.
+
+Be advised that the procedure Martin suggested does preserve uniformity.

There seems to have been a shortage of correct analysis on my part, and
perhaps elsewhere.  Yes, the multiplication as described above does retain
the property of uniformity of distribution (assuming that it was uniform
going in).  No, the modulo strategy does not preserve uniformity.  An
example for those who are getting confused or have misunderstood:

Suppose that you have a uniform random number generator which generates
the integers 0, 1, 2,...,9.  Then we have p(0) = p(1) ... p(9) = 0.1
Now, suppose we take these numbers modulo 7.  Then the numbers 7, 8, and 9
are mapped into 0, 1, and 2 respectively.  The new probabilities become
p(0) = p(1) = p(2) = 0.2, and p(3) ... p(6) = 0.1 as before.  Any modulus
operation other than by 2 will result in a non-uniform distribution.  The
solution is to scale all numbers by the same factor, either compressing 
or expanding the range.  If integers are needed, then you could toss
out any random numbers which are outside the range of interest.
-- 
Maurice Suhre
suhre@trwrb.dsd.trw.com

dfenyes@thesis1.med.uth.tmc.edu (David Fenyes) (05/23/91)

In article <1991May13.221822.27731@gumby.dsd.TRW.COM> suhre@shark.dsd.trw.com.UUCP (Maurice E. Suhre) writes:
>In article <1136@mwtech.UUCP> martin@mwtech.UUCP (Martin Weitzel) writes:

>+SysV User's should note that they have a similar program, called random.
>+It's range of output values is restricted to 0..255 (the upper limit must
>+be specified as a calling argument). If you need larger numbers, you
>+can, of course, call random twice, multiply the result of the first
>+call by 256, add the result of the second call, and take the whole thing
>+modulo the largest number you want. 

>Be advised that the operations mentioned (multiplication, modulo) will,
>in general, change the distribution of the random numbers.  That is, if you
>started with uniformly distributed random numbers, then what you will
>come out with will not be uniformly distributed.

The dist. will be changed-- from uniform on 0...255 to uniform
on 0...65535.

The multiplication 'spreads' the pdf of the first call form 1/256 in each
'bin' from 0...255 to 1/256 in every 256'th bin from 0 to 65280.

Addition of the second number convolves its pdf (0..255) with the first,
'filling out' the summed pdf to a unform distribution of 1/65536 from
0...65535.

If the modulo is a power of 2, then the distribution of the pdf below
the mod value will be uniform.  (The distribution above the value will
be perfectly mapped to the dist. below so that each 'bin' gets the same
amount of added probability).  If the mod value is NOT a power of 2, so
that 633536 MOD n != 0, then the dist will be:

	from 0 to 65536 % n - 1: ([65536 / n] + 1) / 65536
	from 65536 % n up: ([63336 / n]) / 65536

where [x] means the greatest integer <= x.


Regards,

David
---------
David Fenyes                                 dfenyes@thesis1.med.uth.tmc.edu
University of Texas Medical School           Houston, Texas