[net.crypt] Subs, the statistical reconstructor

kay@flame.UUCP (Kay Dekker) (12/02/84)

Lots & lots of people have written to me asking for copies of subs.
Several have suggested posting it to net.sources.  Though I like
getting mail from you all, it would probably be nicer for the people
we send mail through if I did post it.

I will still send copies by mail for people who mailed me for it,
because they may not get the sources newsgroup, but if you do, please
get your copy from there.  You could still mail me & say 'Hi!' anyway :-)

						{Enjoy, }+
							Kay.
-- 
"But you TOLD me to type 'rm * .o', and I DID, and it said 'rm: .o nonexistent', and ...
			... mcvax!ukc!flame!ubu!kay

steven@mcvax.UUCP (Steven Pemberton) (12/03/84)

It was interesting to see this program, not least because I've had a version
running here for a couple of years written in the new language B (no
relation to C's predecessor). The nice thing about the B version is that it
is only 24 lines long, compared with the 198 lines of C. Consequently, I
thought it might be worth posting it here as it is so short. An explanation
of the relevant parts of B follows the program. Comments begin with the
character "\"

HOW'TO RECONSTRUCT document USING n GRAMS:
    PUT " "^^(n-1), {} IN gram, followers
    FOR line IN document: ANALYSE  \ Analyse each line in turn
    FOR count IN {1..10}: GENERATE \ 10 lines of imitation
ANALYSE: \ Analyse one line
    FOR char IN line:
        UPDATE followers FOR gram WITH char
        APPEND char TO gram
    UPDATE followers FOR gram WITH " " \ Treat line end as space
    APPEND " " TO gram
GENERATE: \ Generate one line
    PUT 0 IN length
    CHOOSE gram FROM keys followers \ Choose a random start
    WHILE gram in keys followers AND (length<50 OR char<>" "):
        CHOOSE char FROM followers[gram]
        WRITE char
        PUT length+1 IN length
        APPEND char TO gram
    WRITE / \ Write a newline

HOW'TO UPDATE followers FOR gram WITH char:
    IF gram not'in keys followers: PUT {} IN followers[gram]
    INSERT char IN followers[gram]

HOW'TO APPEND char TO gram:
    PUT (gram^char)@2 IN gram

Sample output, using the poem "Mary had a little lamb" as input, and using
i-grams for i=1..4:
1: nhwaso ldm r ttvncd vetsvaen ryam   wnMhuaaiaa lryulitlab
2: evere Mad ad ts go Mamb snts s s ad lamb flits it
3: ry hat Marywhery was was fleece that lamb ittleece
4: te as snow and everywhere that lamb was white as white

Commentary.
B is a very simple language, about as easy to learn as BASIC, but the big
difference is that B has very powerful data-types, and it's thanks to these
that programming in B is so easy.  It has two basic types, numbers and texts,
and three constructed types, lists, tables, and compounds.

Texts are strings of characters.  For instance "hello" is a text.  There is
no character data-type: you just use a text of length one, such as "a" or "b".
There are operators on texts: a^b joins the two texts a and b, a^^n repeats
the text a n times, and a@n gives the tail of text a starting at the nth
character.

Lists are sorted lists of elements.  E.g., {"B"; "Pascal"; "Smalltalk"} is a
list of texts, as is {"a"; "b"; "c"}, and {"z"}.  {} is the empty list.  You
can insert a new element e in a list l with INSERT e IN l.  Although lists
are kept sorted (alphabetically in the case of texts) the program doesn't
use this fact.

The program deals with n-grams, that is, groups of n letters from a text.
The B program works by associating with each group of n-1 letters, a list of
letters that may follow it.  Thus when dealing with trigrams for the
sentence "Nonsense imitation can be disconcerting", for the letters "on" you
get the list {" "; "c"; "s"}, and for "ns" you get {"e"; "e"}.  This
association is done using the table data-type, which is a generalisation of
arrays: in most other languages, you may only index arrays with integers (or
similar), while in B you can use any type.  In this program a table called
'followers' is used, indexed by texts and giving lists of characters, so that
with the above sentence you get followers["ns"] = {"e"; "e"} for instance.
You can find out which indexes have been used for a table: 'keys followers'
gives the list of such indexes.  {} is also the empty table.

Another interesting point is that little of the program needs to be altered
for it to work with words instead of characters. UPDATE for instance remains
identical.

B is an interactive language, and there are implementations of B for
machines with Unix, and shortly for the IBM PC without Unix.  Anyone who
would like to know more about B or its implementations should mail me.

Steven Pemberton, CWI, Amsterdam; steven@mcvax.