[comp.ai] analysis of unknown data

dave@mimsy.UUCP (03/05/87)

    What systematic methods and techniques would you apply to the
    following problem?

    Determine the representation, organization, and content of a
    "file" containing up to 156MB.  There are no assumptions.  The
methods and techniques applied must be automated (if not fully
automatic) and applicable to an unlimited supply of "files".

marty1@houdi.UUCP (M.BRILLIANT) (03/06/87)

In article <5681@mimsy.UUCP>, dave@mimsy.UUCP (Dave Stoffel) writes:
>     What systematic methods and techniques would you apply to the
>     following problem?
> 
>     Determine the representation, organization, and content of a
>     "file" containing up to 156MB.  There are no assumptions....

I thought that would be impossible.  Theoretically, I would think that
if there are no assumptions there can be no reasoning.  In fact, there
are always tacit assumptions that even the author isn't aware of.  If
I'm wrong, please post to the net, as I imagine there are others as
naive as I am.

M. B. Brilliant					Marty
AT&T-BL HO 3D-520	(201)-949-1858
Holmdel, NJ 07733	ihnp4!houdi!marty1

marty1@houdi.UUCP (M.BRILLIANT) (03/06/87)

In article <5681@mimsy.UUCP>, dave@mimsy.UUCP (Dave Stoffel) writes:
> 
>     What systematic methods and techniques would you apply to the
>     following problem?
> 
>     Determine the representation, organization, and content of a
>     "file" containing up to 156MB.  There are no assumptions.

What systematic or unsystematic methods and techniques would you apply
to the following (seemingly easier) problem?

Determine the machine language of a computer with a 64K address space,
8K of RAM, and 48K of ROM containing an operating system and BASIC.
There is user documentation but no system documentation.  The operating
system has undocumented capabilities for writing in RAM, reading any
byte, and starting execution at any byte in the address space.  Ignore
secondary storage.

M. B. Brilliant					Marty
AT&T-BL HO 3D-520	(201)-949-1858
Holmdel, NJ 07733	ihnp4!houdi!marty1

franka@mntgfx.UUCP (03/11/87)

In article <5681@mimsy.UUCP> dave@mimsy.UUCP (Dave Stoffel) writes:
>
>
>    What systematic methods and techniques would you apply to the
>    following problem?
>
>    Determine the representation, organization, and content of a
>    "file" containing up to 156MB.  There are no assumptions.  The
>methods and techniques applied must be automated (if not fully
>automatic) and applicable to an unlimited supply of "files".

Actually, there are several ways to approach this problem.  It is a statement
of finding out what is happenning inside a classical "black box".  You can
start by monitoring all requests and replies from the file, searching for
patterns based on location of access and length of access.  You can examine
the bytes returning from the device to try to detect patterns.  You can use
a traffic analysis approach by find out what types of programs access this
file at which times for a given purpose.  You can go ask the NSA, CIA, and
other intellegence agencies what they do when they try to crack a black box
(though I doubt that they'd tell you :-).  Finally, most boxes are not com-
pletely black.  In general, you can tell information by the location, size,
etc. of a box.  But unless the box is completely isolated (in which case, why
are you all that interested in what it does?) you can always get some infor-
mation, upon which you can make your own assumptions, can try experiments,
and finally uncover the nature of an object.  You might also try any good
text on experimental methods to point you in the right direction.

Frank Adrian
Mentor Graphics, Inc.

dave@mimsy.UUCP (03/13/87)

In article <564@franka.mntgfx.MENTOR.COM>, franka@mntgfx.MENTOR.COM (Frank A. Adrian) writes:
> Actually, there are several ways to approach this problem.  It is a statement
> of finding out what is happenning inside a classical "black box".  You can
> start by monitoring all requests and replies from the file, searching for
> patterns based on location of access and length of access.  You can examine
> the bytes returning from the device to try to detect patterns.  You can use
> a traffic analysis approach by find out what types of programs access this
> file at which times for a given purpose.  
    the "file" of 156MB is not exactly a black box.  The traditional
black box problem describes functions whose structure is not known.
The "file" is data, not procedure.  An unknown number of procedures
may have participated in creation of the data.  The "file" is
sitting on my machine after being read off of a tape which an
archeologist(sp?) dug up.  What is the data?  Maybe it is one logical
file, maybe hundreds.  If hundreds, maybe each one is a different
type.  Maybe the bytes on the tape are not ordered as logical files,
but as physical blocks from some disk pak.  Put it back together,
so you can tell the archeologist  what information is on the
tape, so he learns something about the civilization which left it.

       Dave Stoffel (703) 790-5357
       seismo!mimsy!dave
       dave@Mimsy.umd.edu
       Amber Research Group, Inc.
-- 
       Dave Stoffel (703) 790-5357
       seismo!mimsy!dave
       dave@Mimsy.umd.edu
       Amber Research Group, Inc.

ben@hpldolm.UUCP (03/18/87)

I have two comments on this discussion; the first is general the second
is specific.

My first comment on this whole discussion, as I understand it, is that
it is silly.  We are being asked to find "the" meaning of some large
file without any context for the file.  Is it text?  Is it integer
data?  Is it floating point data?  Is it encrypted in any way? The 
search for meaning in the absence of context is a waste of time. 
(In essence, I agree with M. B. Brilliant as follows.)

What is meaningful in one context is often not meaningful in another.
However, sometimes, it is.  A file full of integer measurement data will
usually be indistinguishable from a file of a bit-mapped color image.
A bunch of integers is a bunch of integers (unless some *recognizable*
context information is included).  If you take a group of integers and
make a pretty picture with them, what will you do when I tell you that
they were process measurements from a ball-bearing factory?  What will
you do when you interpret a Mandelbrot image as a bad lot of wafers
in an otherwise well controlled fab?

I'm sure that you would like to say that you can't make a pretty
picture with ball bearing data.  Perhaps not in every case, but I know
of a gentleman who *sells* "art" generated from HP stock performance
data.  He has given some stock data meaning in a new context.

The best response to this question was the one from  Mr. Adrian
who suggested that you look for the context(s) that the file
was used in.  If you can't find the correct context, you cannot
ascertain the correct meaning.  If the data exists in a vacuum, you can
choose whatever context that you wish and with enough massaging you
can make the data meaningful.

Second comment:

> Testing for randomness might be the first test; sure would save

Random is too loose of a term.  Are they "random" samples from a
uniform distribution, or "random" samples from a Gaussian distribution?
In either case is the distribution a real population, or a mathematical
model of a distribution function?

I don't want to sound like a flame, but testing for randomness is
ridiculous!  You *cannot* prove a set of data to be "random."  In fact
the key to some encryption schemes is to make a dataset appear "random"
to most simple minded tests.  This does not mean that there is no
information in the data.  It just means that the context of the
information is well hidden from such simple minded filters.

What you are saying when you say that you will test for randomness is
that you will test to see if the data is meaningful in any known
context.  Do you know all possible contexts?  Will you live long enough
to test for all of them?  What happens when the data is meaningful in
more than one context?

---------

Benjamin Ellsworth
hplabs!hpldola!ben
(303) 590-5849

P.O. Box 617
Colorado Springs, CO 80901

2+2=4 (void where prohibited, regulated, or otherwise restricted by law)

dave@mimsy.UUCP (03/23/87)

In article <11160001@hpldolm.HP.COM>, ben@hpldolm.HP.COM (Benjamin Ellsworth) writes:
> My first comment on this whole discussion, as I understand it, is that
> it is silly.  We are being asked to find "the" meaning of some large
> file without any context for the file.  Is it text?  Is it integer
> data?  Is it floating point data?  Is it encrypted in any way? The 
> search for meaning in the absence of context is a waste of time. 

    Maybe I am at fault for inadequately describing the problem,
    but it is neither silly nor a waste of time.  Apart from these
two comments and the later one about test for randomness being
ridiculous, Ben's comments are helpful in further detailing the
possibilities.

> What is meaningful in one context is often not meaningful in another.
> However, sometimes, it is.  A file full of integer measurement data will
> usually be indistinguishable from a file of a bit-mapped color image.
> A bunch of integers is a bunch of integers (unless some *recognizable*
> context information is included).  If you take a group of integers and
> make a pretty picture with them, what will you do when I tell you that
> they were process measurements from a ball-bearing factory?  What will
> you do when you interpret a Mandelbrot image as a bad lot of wafers
> in an otherwise well controlled fab?
> I'm sure that you would like to say that you can't make a pretty
> picture with ball bearing data.  Perhaps not in every case, but I know
> of a gentleman who *sells* "art" generated from HP stock performance
> data.  He has given some stock data meaning in a new context.

    I wouldn't like to say you can't have multiple representations of a set of data poin

    However, one man's "art" is simply another man's pictoral or
    imagic presentation of stock data.  (Particularly if the raw
stock data was not convaluted by the artist).  In fact, it might
be a useful presentation for certain kinds of trend analysis.

> The best response to this question was the one from  Mr. Adrian
> who suggested that you look for the context(s) that the file
> was used in.  If you can't find the correct context, you cannot
> ascertain the correct meaning.  If the data exists in a vacuum, you can
> choose whatever context that you wish and with enough massaging you
> can make the data meaningful.

    Certainly there is a pitfall in the analytic process; one may
    "discover" meaning that was not the intent of the creator of
the data.  So it goes, sometimes.

    "finding the correct context" and "finding the meaning" are the same thing!

> Random is too loose of a term.  Are they "random" samples from a
> uniform distribution, or "random" samples from a Gaussian distribution?
> In either case is the distribution a real population, or a mathematical
> model of a distribution function?
> I don't want to sound like a flame, but testing for randomness is
> ridiculous!  You *cannot* prove a set of data to be "random."  In fact
> the key to some encryption schemes is to make a dataset appear "random"
> to most simple minded tests.  This does not mean that there is no
> information in the data.  It just means that the context of the
> information is well hidden from such simple minded filters.

    Hmm.  I think what I mean is that if the data set appears to be a Gaussian
distribution, then I'm not going to apply any other tests.

> What you are saying when you say that you will test for randomness is
> that you will test to see if the data is meaningful in any known
> context.  Do you know all possible contexts?  Will you live long enough
> to test for all of them?  What happens when the data is meaningful in
> more than one context?

    I can't possibly imagine all conceivable or theoretic contexts.  I can imagine too many to try.
I am looking for an analytic process that is more efficient than enumerating all the context tests I can
imagine.  If multiple context tests yield "reasonable" representations,
I might just have to flip a coin or allow for all interpretations.

    I never said that the data has no context!  I simply said that I don't know a-priori what its context
is.  It *is* the case that data points can be analysed in the absence of
knowledge of the structure of the function which produced them.  The object is to detect patterns, if possible,
and search for "meaningful" interpretations.

   Some of the discussion of this subject sounds like the participants
   are frustrated by these two facts:

1.  I *won't* live long enough to apply every possible context
test.  (Discovery by enumeration).

and

2.  they don't know of any more efficient methodology than discovery
by enumeration, ergo the problem is silly or a waste of time.


-- 
       Dave Stoffel (703) 790-5357
       seismo!mimsy!dave
       dave@Mimsy.umd.edu
       Amber Research Group, Inc.