[net.cog-eng] Utterances

douglas@bcsaic.UUCP (douglas schuler) (08/12/86)

I have a very simple question that I'd like an answer to.

What percentage of the utterances made in the world have been made before?

A similar question would be "how many sentences in today's New York Times
are duplicates?"

I've had three answers so far.  40%, less than 25%, and 7.81231%

I can think of no immediate use for this information.
-- 

   ** MY VIEWS MAY NOT BE IDENTICAL TO THOSE OF THE BOEING COMPANY **

	Doug Schuler     (206) 865-3228
	{allegra,ihnp4,decvax}uw-beaver!uw-june!bcsaic!douglas
	bcsaic!douglas@uw-june.arpa

gilbert@aimmi.UUCP (Gilbert Cockton) (08/20/86)

In article <639@bcsaic.UUCP> douglas@bcsaic.UUCP writes:
>
>What percentage of the utterances made in the world have been made before?
>
Depends where you are and how clockwork the people are. Put yourself
amongst a group of high conformers and you could be up to 80%. I find
some stereotype-fitting groups (e.g. Sloans, provincial business men,
Militants, Born-again Christians, worn out Trade Unionists,
politicians, usenet policemen, etc. etc. ad nauseam) boringly predictable.

Repeated utterances are linguistic reflections of stable shared beliefs.
Sociologists and cultural anthropologists have been modelling this for
years (its called ideology), which makes the current frantic search
(blind) for knowledge acquistion techniques a little ironic.

Knowledge engineers! Pah, read Malinowski, then if you can handle big
words, try some recent sociological theory on ideology.

wex@milano.UUCP (08/22/86)

In article <792@aimmi.UUCP>, gilbert@aimmi.UUCP (Gilbert Cockton) writes:
> In article <639@bcsaic.UUCP> douglas@bcsaic.UUCP writes:
> >What percentage of the utterances made in the world have been made before?
>
> Depends where you are and how clockwork the people are. Put yourself
> amongst a group of high conformers and you could be up to 80%. I find
> some stereotype-fitting groups (e.g. Sloans, provincial business men,
> Militants, Born-again Christians, worn out Trade Unionists,
> politicians, usenet policemen, etc. etc. ad nauseam) boringly predictable.

I think that Mr. Cockton misinterprets the original posting, but does so in
an interesting way.  It seemed to me that the original request had to do
with linguistic repetitions, where a sentence is repeated word-for-word.
(Aside: do translations count?)

What Cockton has done is to cast the question in the semantic domain by
remarking on some people's lack of original thought.  It is not important
what words they use; the ideas are repeated.  That leads to the interesting
question: "how much thought is original?"  If the answer is low, then ought
we emphasize `research' that's just sophisticated searching of previously-
recorded knowledge?

-- 
Alan Wexelblat
ARPA: WEX@MCC.ARPA or WEX@MCC.COM
UUCP: {ihnp4, seismo, harvard, gatech, pyramid}!ut-sally!im4u!milano!wex

"It is quite impossible for any design to be `the logical outcome of the
requirements' simply because, the requirements being in conflict, their
logical outcome is an impossibility."

begeman@milano.UUCP (08/22/86)

In article <639@bcsaic.UUCP>, douglas@bcsaic.UUCP (douglas schuler) writes:
> I have a very simple question that I'd like an answer to.
> 
> What percentage of the utterances made in the world have been made before?
> 
An interesting, though slightly ambiguous question.  What follows is a
report of a statistical technique used to answer similar questions such
as:
	How many words did Shakespeare know, but not use in his works?

	How many species of butterfly exist on this island, but which
	I did not capture?

The January 24, 1986 issue of Science magazine (p. 335) has an article
titled "Shakespeare's New Poem:  An Ode to Statistics" which spells this
out:  In the 1940's, a biologist was collecting butterflys in Malaysia
and noticed that he caught members of some species dozens of times, 
some species several times, and others just once.  He told a statistician
by the name of Sir Ronald Fisher what species he had seen (and how many
times he had seen each), and then asked the question How many species are
there that he did *not* see?

Assuming that the butterflys were randomly captured in proportion to
how many of each species there are, the question is, surprisingly,
answerable.

The same technique was recently applied to determine the authenticity
of an alleged Shakespearean poem discovered in Oxford in 1985.
Fortunately for the researchers, all of Shakespeare's works have been
put into machine readable form.  Using a modified "butterfly" technique,
the researchers predicted how many words in a new work of a certain
length would have been never used before, used just once before, used
twice before, and so on.  (The new poem, by the way, fell into the
Shakespearean pattern beautifully - it is now considered to be an
authentic "find".)

For more details on the technique, please check out the original
article in Science.  There was a followup letter and clarification
in the March 21, 1986 issue.

Now as for how many utterances in the world have been made before,
well, if we start counting....

References:

      R.A. Fisher, A.S. Corbet, C.B. Williams, "The relation between
      the number of species and the number of individuals in a random
      sample of an animal population," J. Anim. Ecol. 12, 42(1943).

      B. Efron and R. Thisted, "Estimating the number of unseen
      species:  How many words did Shakespeare know?" Biometrica 63,
      435(1976).

      "Shakespeare's New Poem:  An Ode to Statistics," Science v.321
      335(1986).

--
	Of all the things I've lost, I miss my mind the most.

Michael L. Begeman              Microelectronics and Computer Technology Corp
Software Technology Program     Austin (where the sun always shines) Texas

uucp:	{ihnp4, seismo, harvard, gatech, pyramid}!ut-sally!im4u!milano!begeman
arpa:	begeman@mcc.ARPA

peterson@milano.UUCP (08/26/86)

In article <639@bcsaic.UUCP>, douglas@bcsaic.UUCP (douglas schuler) writes:
> What percentage of the utterances made in the world have been made before?

Since no one else has asked, What is an utterance? Verbal or written?
I can see the following possibilities:

1. word
2. phrase
3. sentence

It seems unlikely that "1. word" is of interest -- this is effectively
the rate of production of new words, which (I assume) is very very
low.

It seems unlikely that "3. sentence" is of interest -- many verbal
utterances are incomplete sentences.

Hence "2. phrase" seems most likely.  Now how do we define a phrase?
Semantically or simply as a sequence of words? what about subphrases
and superphrases?
-- 
James Peterson
peterson@mcc.com  or  ...sally!im4u!milano!peterson

gilbert@aimmi.UUCP (Gilbert Cockton) (09/01/86)

In article <2133@milano.UUCP> wex@milano.UUCP writes:

>I think that Mr. Cockton misinterprets the original posting, but does so in
>an interesting way.  It seemed to me that the original request had to do
>with linguistic repetitions, where a sentence is repeated word-for-word.
>

Glad to have my ideas regarded as interesting, thank you.  However, I think
I've widened rather than misinterpreted the question in the original posting,
as I was thinking about repeated utterances as * linguistic reflections *
of shared beliefs, and not just the beliefs/ideas themselves. I was being
logocentric.

>What Cockton has done is to cast the question in the semantic domain by
>remarking on some people's lack of original thought.  It is not important
> what words they use; the ideas are repeated.  

Granted, but the same words are often used by stereotype-fitting groups 
(e.g. Sloans, provincial business men, Militants, Born-again Christians, 
worn out Trade Unionists, politicians, usenet policemen, etc. etc. ad nauseam,
PLUS some groups I belong to).  Here are some made up examples (no time to do
proper research - excuse empirical laziness) for each group:

	"Don't be such an oik Jeremy"
	"This government does nothing for small businesses"
	"The Lord will provide"
	"Workers, fight capitalist oppression by a police state!"
	"We are in a confrontation situation"
	"I cannot answer hypothetical questions"
	"FLAME ON!!!!! Keep this junk out of net.general"
	"Let's not get bogged down in methodology"
	"Can you see anything from the top?"
	"Divn't ye work yer ticket wi' me!"

>  .........................................  That leads to the interesting
> question: "how much thought is original?" 

And also to the question "How many utterances have any conscious thought
behind them?" (Marxist answer: Only ours, false consciousness doesn't count!).
Many utterances have a ritual significance which say more about the speaker's
role and place in a group than they do about the speaker's active model of
the manipulable world. Thus I expect ritual utterances (church services,
police arrest speeches, annual reports, social niceties) to be the most
repeated utterances, after advertising cliches!

I've seen a few statistical contributions to the discussion so far. Given
an adequate corpus of utterances for the experimental situation (which
you'll never get as you don't know what an adequate corpus is until you
know the significant variables, and you won't know what these are until
you've studied a representative population) you might get a reliable
NUMBER (wow). However, as with all statistics, you'll be no wiser about WHY
certain utterances are so common in certain contexts.

Death to all naive empiricism. I want explanations, not descriptions. As for
data ...................