[net.sources] A Usenet article generator

jbuck@epimass.UUCP (02/18/87)

#! /bin/sh
# This is a shell archive, meaning:
# 1. Remove everything above the #! /bin/sh line.
# 2. Save the resulting text in a file.
# 3. Execute the file with /bin/sh (not csh) to create the files:
#	README
#	Makefile
#	markov3.l
#	markov3.6
#	getopt.c
# This archive created: Tue Feb 17 20:38:40 1987
export PATH; PATH=/bin:$PATH
if test -f 'README'
then
	echo shar: will not over-write existing file "'README'"
else
cat << \SHAR_EOF > 'README'
I created a bit of a stir with this program in December 1986 when I
used an earlier version of it to simulate a certain well-known net
personality (Hi Laura!).  It digests Usenet articles and spits out
other articles with similar characteristics.  You need lex to run it,
but otherwise it should run on any Unix I know of.  

I had several requests for the program but didn't consider it
"ready".  It's as ready as it will ever be now.

The program uses getopt(3).  There are several public-domain versions
available for Berkeley systems from the mod.sources archives.  Since
it's small, I've included Henry Spencer's version, but you'll have
to change the Makefile to use it.

For best results, feed it at least ten articles by the same person
or on the same subject.  If there are fewer articles the output
resembles the original too much; if there is too much variety in
the articles the output is more incoherent than it otherwise is.

The program requires lots of memory if it is given lots of input;
the small-model people will have problems.

Please don't post the output to the net (though I'd be happy to
see some of the more interesting results).

Send comments, suggestions for improvement, fan mail, and flames
to me: {sun,hplabs,ames,ihnp4}!oliveb!epimass!jbuck.
SHAR_EOF
fi # end of overwriting check
if test -f 'Makefile'
then
	echo shar: will not over-write existing file "'Makefile'"
else
cat << \SHAR_EOF > 'Makefile'
CFLAGS=-O

GOPT=
# BSD people remove the following comment
# GOPT=getopt.o
markov3: markov3.o $(GOPT)
	cc $(CFLAGS) markov3.o $(GOPT) -o markov3

markov3.c:	markov3.l
		lex markov3.l
		mv lex.yy.c markov3.c

SHAR_EOF
fi # end of overwriting check
if test -f 'markov3.l'
then
	echo shar: will not over-write existing file "'markov3.l'"
else
cat << \SHAR_EOF > 'markov3.l'
%{
/*
 * Copyright (c) 1986 by Joe Buck
 *
 * Permission is granted to use, redistribute, or modify this program,
 * as long as you don't pretend you wrote it.  Send improvements or
 * bug reports to {ihnp4,hplabs,sun}!oliveb!epimass!jbuck.
 *
 * The program generates simulated Usenet articles, given Usenet articles
 * as input.
 *
 * This program constructs a table of frequencies for a token appearing,
 * given the two preceding tokens.  A "token" is a sequence of non-blank
 * characters.  An entirely blank line is also treated as a token, as is
 * the beginning and end of an article.
 *
 * The program is designed to process news articles, rejecting text from
 * the header, signature, and included text, together with cruft inserted
 * by rn.
 *
 * After the table is built (and it can be big), articles are generated
 * on the standard output.
 */
int in_included_text = 0;
%}
%Start HDR BODY SIG
%%
<HDR>^[^ \t]+:.*\n	;	/* Header line, e.g. "From: foo@bar.UUCP" */
<HDR>^[ \t]+[^ \t].*\n	;	/* Continuation of header line */
<HDR>^[ \t]*$		BEGIN BODY;
<BODY>^[><].*\n		{ if (!in_included_text) {
    			      in_included_text = 1;
			      process_token ("\n> ...\n\n");
			      }
			}
<BODY>^"In article".*\n	;	/* Reject rn crud */
<BODY>^"-- "$		BEGIN SIG;
<BODY>[ \t]+		;	/* Skip white space */
<BODY>\n[ \t\n]*\n	{ process_token ("\n"); /* Paragraph break */}
<BODY>[^ \t\n]+		{ in_included_text = 0; process_token (yytext); }
<HDR>.			;	/* Eat anything that escaped */
<HDR>\n			;
<BODY>\n		;
<SIG>.			;
<SIG>\n			;
%%
void perror(), exit();
char *strcpy(), *malloc();
extern int optind;
extern char *optarg;

/*
 * hashtab is a hash table storing all the tokens we encounter.
 */
struct htentry {
    char *htext;
    struct htentry *hlink;
};

#define HSIZE 3557		/* Should be prime */
#define Fprintf (void)fprintf
#define Printf (void)printf

struct htentry hashtab[HSIZE];

/*
 * node and succnode are portions of the big structure we're going to build.
 * node represents something like ("was", "a") in a binary tree.
 * a linked list of succnodes contain tokens that may follow ("was", "a")
 */
struct node {
    char *text;
    char *text2;
    int ocount;
    struct node *lc, *rc;
    struct succnode *succ;
};

struct succnode {
    struct node *scnod;
    int    count;
    struct succnode *link;
};


struct node *prev_code = NULL;
char *prev_token = NULL, **Argv;
int init_state = HDR;
int verbose = 0;
struct node *root = NULL, *tknptr;
struct succnode *start = NULL;
int n_pairs = 0, n_tokens = 0, n_files = 0, n_total = 0;

struct node *insert_token();
char *savetoken();

process_token (txt)
char *txt;
{
     struct node *code;
     char *token = savetoken (txt);
/* We have a new token.  Say the previous two tokens were "one" "way"
 * and the current token is "to".  Then prev_code points to a node
 * for ("one", "way") and token is "to".  This function adds ("way", "to") as a
 * successor to ("one","way") and makes prev_code point to ("way","to").
 */
     code = insert_token (prev_token, token);
     insert_pair (prev_code, code);
     prev_code = code;
     prev_token = token;
     return;
}

/*
 * here it is.
 */
main (argc, argv)
int argc;
char  **argv;
{
    int     i, c, n_articles = 10;
    char *dumpfile = NULL;
    extern int  optind;
    extern char *optarg;

    while ((c = getopt (argc, argv, "pvn:d:s:")) != EOF) {
	switch (c) {
	    case 'v':
		verbose = 1;
		break;
	    case 'p':		/* Input is plain text, not Usenet stuff */
		init_state = BODY;
		break;
	    case 'n': 		/* # articles to generate */
		n_articles = atoi (optarg);
		break;
	    case 'd':		/* where to dump the data structure */
		dumpfile = optarg;
		break;
	    case 's':		/* Set the seed for rand */
		srand ((unsigned) atoi (optarg));
		break;
	    default:
		Fprintf (stderr, "Usage: markov3 [-p] [-n n_art] [-d dump] files");
		exit (1);
	}
    }
    BEGIN init_state;		/* initial state of lexical analyzer */
/* Note: if optind == argc, there are no file arguments.  yyin is left
 * initialized to stdin.
 */
    if (optind < argc) {
/* yyin is lex input stream.  Point to first file. */
	if ((yyin = fopen (argv[optind], "r")) == NULL) {
	    perror (argv[optind]);
	    exit (1);
	}
    }
    Argv = argv;		/* make it global so yywrap can access it */
    n_files = 1;
/* yylex puts all the input files through the lexical analyzer and builds
 * the database.
 */
    (void) yylex ();
    if (dumpfile)
	dump_database (dumpfile);
    if (verbose)
	Fprintf (stderr,
	 "Total of %d tokens (%d different), %d different pairs, %d files\n",
		n_total, n_tokens, n_pairs, n_files);
/* Generate the articles, separated by form feeds */
    for (i = 0; i < n_articles; i++) {
	if (i > 0) output_word ("\n\f\n");
	generate_article ();
    }
    return 0;
}

/*
 * Lex calls this when EOF is reached.  It opens the next file if there
 * is one.  Lex interprets a return value of 1 to mean "all done" and 0
 * to mean "keep going".
 */
yywrap () {
    (void) fclose (yyin);
    insert_pair (prev_code, (struct node *)0);
    prev_code = NULL;
    if (Argv[optind] == NULL) return 1;
    else if ((yyin = fopen (Argv[optind], "r")) == NULL) {
	perror (Argv[optind]);
	exit (1);
    }
    optind++;
    in_included_text = 0;
    if (verbose && n_files % 10 == 0)
	Fprintf (stderr, "%d files\n", n_files);
    n_files++;
    BEGIN init_state;
    return 0;
}

/*
 * This function saves a token in the hash table (if it isn't there
 * already) and returns a pointer to the stored copy.
 */
char *
savetoken (txt)
char *txt;
{
    int h;
    char *p;
    struct htentry *hp;

    n_total++;
    for (p = txt, h = 0; *p; h += *p++);
    hp = hashtab + (h % HSIZE);
    while (hp->hlink) {
	if (strcmp (hp->htext, txt) == 0) {
	    return hp->htext;
	}
	hp = hp->hlink;
    }
/* OK, it's a new token.  Make hp->hlink point to a new,
 * null block and make hp->htext point to the text.
 */
    hp->hlink = (struct htentry *) malloc (sizeof *hp);
    hp->htext = malloc ((unsigned)(strlen (txt) + 1));
    (void) strcpy (hp->htext, txt);
    hp->hlink->hlink = NULL;
    hp->hlink->htext = NULL;
    n_tokens++;
    return hp->htext;
}

/*
 * This recursive function inserts a token pair into the tree.
 */
struct node *
insert_in_tree (p, txt, txt2)
struct node *p;
char *txt, *txt2;
{
    int cmp;
    if (p == NULL) {
/* Create a new node. */
	p = (struct node *) malloc (sizeof *p);
	p->text = txt;
	p->text2 = txt2;
	p->lc = p->rc = NULL;
	p->succ = NULL;
	p->ocount = 1;
	tknptr = p;
	n_pairs++;
	if (verbose && n_pairs % 1000 == 0)
	    Fprintf (stderr, "%d pairs\n", n_pairs);
	return p;
    }
    cmp = my_strcmp (p->text, txt);
    if (cmp == 0) cmp = my_strcmp (p->text2, txt2);
    if (cmp == 0) {
/* It's a match.  Increment the count. */
        tknptr = p;
	p->ocount += 1;
    }
/* Look in the subtrees. */
    else if (cmp < 0) p->lc = insert_in_tree (p->lc, txt, txt2);
    else p->rc = insert_in_tree (p->rc, txt, txt2);
    return p;
}

/*
 * This just calls insert_in_tree starting at the root
 */
struct node *
insert_token (txt, txt2)
char *txt,*txt2;
{
    root = insert_in_tree (root, txt, txt2);
    return tknptr;
}

/*
 * This function adds a successor.
 */
struct succnode *
insert_in_succ_chain (sp, np)
struct succnode *sp;
struct node *np;
{
    if (sp == NULL) {
	sp = (struct succnode *) malloc (sizeof *sp);
	sp->scnod = np;
	sp->count = 1;
	sp->link = NULL;
    }
    else if (sp->scnod == np)
	sp->count += 1;
    else sp->link = insert_in_succ_chain (sp->link, np);
    return sp;
}

/*
 * This calls insert_in_succ_chain starting at the right place.
 */
insert_pair (p1, p2)
struct node *p1, *p2;
{
    if (p1) p1->succ = insert_in_succ_chain (p1->succ, p2);
    else start = insert_in_succ_chain (start, p2);
}

/*
 * This function dumps the stored data structure onto a file.
 * Now if only I had a function to read it back in.
 */
char *
pr_token (txt)
char *txt;
{
    if (txt[0] != '\n')
	return txt;
    return txt[1] ? "<INCL>" : "<LF>";
}

treedump (tree, fp)
struct node *tree;
FILE *fp;
{
    if (tree) {
	treedump (tree->rc, fp);
	Fprintf (fp, "( %s %s ) %d", pr_token (tree->text),
			pr_token (tree->text2), tree->ocount);
	chaindump (tree->succ, fp);
	treedump (tree->lc, fp);
    }
}

/*
 * Subroutine of treedump; it does one row.
 */
chaindump (p, fp)
struct succnode *p;
FILE *fp;
{
    char   *text;
    while (p) {
	if (p->scnod == NULL)
	    text = "<EOF>";
	else text = pr_token (p->scnod->text2);
	Fprintf (fp, " %s %d", text, p->count);
	p = p->link;
    }
    putc ('\n', fp);
}

/*
 * This routine generates the dump file (-d option)
 */
dump_database (file)
char *file;
{
    FILE *fp = fopen (file, "w");
    if (fp == NULL) {
	Fprintf (stderr, "markov: can't open ");
	perror (file);
	exit (1);
    }
    Fprintf (fp, "START:");
    chaindump (start, fp);
    treedump (root, fp);
}

/* roll (n) generates a uniformly distributed rv between 0 and n-1.
 * This may need to be changed for your system.
 */
roll (n) {
    return (int) (n * (double) rand() / 32768.);
}

/*
 * Here it is!  This function generates an article by traversing the
 * structure we've built.
 */
generate_article () {
    struct succnode *p = start;
    int ncounts = n_files;
    int n, accum;
    char *tp;

    while (1) {
/* Roll the dice to find out the next token.  The code below selects the
 * next token, and the new state, with a probability corresponding to the
 * frequency in the input.
 */
	n = roll (ncounts);
	accum = p->count;
	while (accum <= n && p->link) {
	    p = p->link;
	    accum += p->count;
	}
	tp = p->scnod->text2;
/* Check for "end of story" */
	if (tp == NULL)
	    break;
	output_word (tp);
	ncounts = p->scnod->ocount;
	p = p->scnod->succ;
    }
    putchar ('\n');
    return;
}

/*
 * This version handles null strings *
 */
my_strcmp (a, b)
register char *a, *b;
{
    if (a == NULL) return b ? -1 : 0;
    if (b == NULL) return 1;
    return strcmp (a, b);
}

#define LEN 75
output_word (word)
char *word;
{
    static char line[LEN+1];
    static int room = LEN;
    int l = strlen (word);
/* If word won't fit, or starts with \n, dump the current line */
    if ((l >= room || word[0] == '\n') && line[0]) {
	Printf ("%s\n", line);
	line[0] = 0;
	room = LEN;
    }
/* If word won't fit in the buffer or starts with \n, print it now */
    if (l >= LEN)
	Printf ("%s\n", word);
    else if (word[0] == '\n')
	Printf ("%s", word);
/* Otherwise fill it in */
    else {
	(void)strcat (line, word);
	(void)strcat (line, " ");
	room -= (l + 1);
    }
    return;
}
SHAR_EOF
fi # end of overwriting check
if test -f 'markov3.6'
then
	echo shar: will not over-write existing file "'markov3.6'"
else
cat << \SHAR_EOF > 'markov3.6'
.\" markov3
.TH MARKOV3 6 "17 Feb 1987"
.UC 4
.SH NAME
markov3 \- Digest and spit out quasi-random Usenet articles
.SH SYNOPSIS
.B markov3
[
.B \-pv
] [
.B \-n
.I n_articles
] [
.B \-d
.I dumpfile
] [
.B \-s
.I seed
files
.SH DESCRIPTION
.PP
.I Markov3
digests Usenet articles and builds an internal data structure that
models the articles as if they came from a random process, where
each word is determined by the previous two.  It then emits a series
of articles on the standard output that have the same distribution
of words, word pairs, and word triplets as do the input files.
The name
.I markov3
comes from the fact that this structure is called a Markov chain,
and that the statistics for word triplets are modeled.
Here, a "word" is a sequence of printable characters surrounded by
whitespace.  Paragraph breaks (blank lines) are also treated as a
"word".
.PP
By default, the program expects to be fed Usenet articles; it strips
off headers, included text, and signatures (or at least it tries).
The
.B \-p
(plain) option disables the header-stripping feature (otherwise
everything is skipped until a blank line is encountered).
.PP
By default, 10 articles, separated by form feeds, are written on the
standard output.  The
.B \-n
option lets you specify a different number.
.PP
The
.B \-d
(dump) option dumps a representation of the internal data structure
built by
.I markov3
on the named file.
.PP
Finally, the
.B \-v
(verbose)
option prints some statistics on the standard error.
.SH "CAVEATS"
This program allocates lots of memory if given large amounts of input.
On virtual memory systems, the paging behavior is atrocious because
pointers tend to point every which way, and many pointers are dereferenced
for every word processed.  This could be improved, I'm sure.
.PP
Posting articles generated by
.I markov3
to the net may be hazardous to your health.
.PP
Not as smart as Mark V. Shaney.
.SH "PORTABILITY"
An effort has been made to make this program as portable as possible;
however, you need lex(1).  It uses the rand(3) function for random
number generation; look here if the output seems too predictable.  If
there is a problem, you'll just have to redo the function roll(n),
which generates a random integer between 0 and n-1.  I'm not certain
how portable rand(3) is.  
.PP
If you don't have lex, you'll need to rewrite the lexical analyzer
but most of the program is in C.
SHAR_EOF
fi # end of overwriting check
if test -f 'getopt.c'
then
	echo shar: will not over-write existing file "'getopt.c'"
else
cat << \SHAR_EOF > 'getopt.c'
/*
 * getopt - get option letter from argv
 * by Henry Spencer
 * posted to Usenet net.sources list
 */

#include <stdio.h>

char	*optarg;	/* Global argument pointer. */
int	optind = 0;	/* Global argv index. */

static char	*scan = NULL;	/* Private scan pointer. */

extern char	*index();

int
getopt(argc, argv, optstring)
int argc;
char *argv[];
char *optstring;
{
	register char c;
	register char *place;

	optarg = NULL;

	if (scan == NULL || *scan == '\0') {
		if (optind == 0)
			optind++;
	
		if (optind >= argc || argv[optind][0] != '-' || argv[optind][1] == '\0')
			return(EOF);
		if (strcmp(argv[optind], "--")==0) {
			optind++;
			return(EOF);
		}
	
		scan = argv[optind]+1;
		optind++;
	}

	c = *scan++;
	place = index(optstring, c);

	if (place == NULL || c == ':') {
		fprintf(stderr, "%s: unknown option -%c\n", argv[0], c);
		return('?');
	}

	place++;
	if (*place == ':') {
		if (*scan != '\0') {
			optarg = scan;
			scan = NULL;
		} else {
			optarg = argv[optind];
			optind++;
		}
	}

	return(c);
}
SHAR_EOF
fi # end of overwriting check
#	End of shell archive
exit 0
-- 
- Joe Buck 	{hplabs,ihnp4,sun,ames}!oliveb!epimass!jbuck
  Entropic Processing, Inc., Cupertino, California

cracraft@sonia.UUCP (02/18/87)

In article <906@epimass.UUCP> jbuck@epimass.UUCP (Joe Buck) writes:
>I created a bit of a stir with this program in December 1986 when I
>used an earlier version of it to simulate a certain well-known net
>personality (Hi Laura!).  It digests Usenet articles and spits out
>other articles with similar characteristics.  You need lex to run it,
>but otherwise it should run on any Unix I know of.  
>- Joe Buck 	{hplabs,ihnp4,sun,ames}!oliveb!epimass!jbuck
>  Entropic Processing, Inc., Cupertino, California

Well, I compiled and run it, but at the end of producing its amusing
output, it constantly gives "Segmentation fault - core dumped". This
is on a VAX running 4.3BSD.

Stuart

falk@peregrine.UUCP (02/19/87)

>In article <906@epimass.UUCP> jbuck@epimass.UUCP (Joe Buck) writes:
>>I created a bit of a stir with this program in December 1986 when I
>>used an earlier version of it to simulate a certain well-known net
>>personality (Hi Laura!).  It digests Usenet articles and spits out
>>other articles with similar characteristics.
>
>Well, I compiled and run it, but at the end of producing its amusing
>output, it constantly gives "Segmentation fault - core dumped". This
>is on a VAX running 4.3BSD.
>
>Stuart

This happened to me on a Sun.  I tracked it down to these few lines:

		  tp = p->scnod->text2;
	  /* Check for "end of story" */
		  if (tp == NULL)
		      break;

Sometimes "p->scnod" is null, so you get a seg fault.  I added the
line:

		if(tp->scnod  ==  NULL)
		  break ;

That prevented the core dumps, but the program always gives me the
exact same output.  I think the random number generator is being
reinitialized to the same thing one each loop.


		-ed falk, sun microsystems
		 sun!falk, falk@sun.com
terrorist, cryptography, DES, drugs, cipher, secret, decode, NSA, CIA, NRO.

slindahl@udenva.UUCP (02/20/87)

In article <4521@curly.ucla-cs.UCLA.EDU> cracraft@sonia.UUCP (Stuart M. Cracraft) writes:
>In article <906@epimass.UUCP> jbuck@epimass.UUCP (Joe Buck) writes:
>>I created a bit of a stir with this program in December 1986 when I
>>used an earlier version of it to simulate a certain well-known net
>>personality (Hi Laura!).  It digests Usenet articles and spits out
>>other articles with similar characteristics.  You need lex to run it,
>>but otherwise it should run on any Unix I know of.  
>
>Well, I compiled and run it, but at the end of producing its amusing
>output, it constantly gives "Segmentation fault - core dumped". This
>is on a VAX running 4.3BSD.
>
>Stuart

Well I compiled it on BSD4.2 and I get the same error.  Any suggestions?


-- 
][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][
Steve J. Lindahl     |UUCP      {hplabs,seismo}...!hao!udenva!duorion!slindahl
University of Denver |BITNET    slindahl@ducair
------------------------------------------------------------------------------
Major of the week::DECISION SCIENCES |        Survive, Adapt, Overcome
][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][