[mod.ai] Dear Abby, Analysis of unknown data.

zs01#@ANDREW.CMU.EDU (Zalman Stern) (03/08/87)

Dear Abby:

Explanation of results in an expert system should be viewed as a method of
communication between intelligent entities. Conventional groups of human
experts tend to fail very badly when nobody tells anybody else what is going
on. If you expect anything different to happen with artificial experts, you
are very disillusioned. I think explanation facillities must be designed into
the standard interface a program uses to communicate with humans and other
expert systems. Of course teling too much tends to bore people also... Why
not view AI as a chance to fix some of the bugs in human communication?

Analysis of unknown data:

I guess the idea here is to come up with an expert version of the UNIX file
program. (file is a program which is executed like "file core" and it tells
you "core:	core file from 'loseprog'") The file program is written
using very ad hoc techniques. It knows about all the magic numbers commonly
used in a UNIX system, about keywords for common languages, patterns that
occur in various kinds of text... As you can guess, it assumes a lot.

One of the first things to realize is that there are files for which your
system is not going to be able to come up with any useful information. Try
feeding it 156MB of perfectly random numbers for example. One must also
figure out what kind of explanations this system is going to give. In the
organization category do you want explanations of the form "The file is
columnized data." or "This file is in the proper format of a doctoral
disertation in Computer Science at Carnegie Mellon University?"

Once the program has figured out what the file is, it can easily extract the
"representation, organization, and content" of the file using information
from its knowledge base. So the problem has become one of designing a pattern
matcher, and coming up with a knowledge base that knows about all kinds of
files. Optionally, the program could try and deduce all the information
desired from the file, but I think that would be much more difficult to do.

Here is one way to approach this problem:

Design a number of representations of a file. Examples of these are:

	- ASCII text in line format. (i.e. like your favorite editor does
it).
	- A numerical dump of the file.

Also, there are many formats specific to certain programs. For these, the
representation is derived from firing up the appropriate program on the file.
For example, if you are trying to classify a system executable, you will want
to run the system debugger (or disassembler) on the file. There is an
assumption here that files don't exist in a vacuum. If they did, they would
be useless.

Now that this is done, you are ready to start building a knowledge base. To
do this you want to have a driver program that allows an expert to examine
files and enter information into the system. The driver progam will need
enoug "intelligence" to ask the expert why he did certain things. Of course
you can have humans analyze the experts answers and encode them
appropriately. Then just get a bunch of experts, and a large file system and
let them hack at it...

I think this may even be doable, but I doubt it would be worthwhile. 

Have I made too many assumptions? Is this general enough? Is this what you
consider automated?

Sincerely,
Zalman Stern
ARPA: zs01#@andrew.cmu.edu

Disclaimer: I am not involved in any kind of AI research and never have been.

dave@MIMSY.UMD.EDU.UUCP (03/09/87)

>I guess the idea here is to come up with an expert version of the UNIX file
>program.

    The problem with the `file' approach is that it assumes one
    has already a knowledge of the "files" he is attacking.  So,
this technique might become more and more useful, but only "might".

>One of the first things to realize is that there are files for
>which your system is not going to be able to come up with any
>useful information. Try feeding it 156MB of perfectly random
>numbers for example.

    Testing for randomness might be the first test; sure would save
    a lot of subsequent computing if it were random.

>files. Optionally, the program could try and deduce all the information
>desired from the file, but I think that would be much more difficult to do.

    Yep.  It would be nice to take a goal-driven, top-down approach,
    but sometimes data-driven inference, e.g., auto-correlation,
is what there is.

>representation is derived from firing up the appropriate program on the file.
>For example, if you are trying to classify a system executable, you will want
>to run the system debugger (or disassembler) on the file. There is an
>assumption here that files don't exist in a vacuum. If they did, they would
>be useless.

   Their uselessness and whether they exist in a vacuum is an assumption.

-- 
       Dave Stoffel (703) 790-5357
       seismo!mimsy!dave
       dave@Mimsy.umd.edu
       Amber Research Group, Inc.