[comp.text.desktop] document analysis

chuq%plaid@Sun.COM (Chuq Von Rospach) (06/01/87)

Date: Fri, 29 May 87 09:56:04 PDT
From: mlwh@sphinx (Martin Hall)

I would be interested in finding out about document analysis.  I mean
this in as general as a sense as you want to take it.  Any pointers
would be appreciated.

Some of the areas that I would be particularly interested are:
	Affects of different typestyles, page layouts, etc on the reader
	Analysis of textual content
		-- how to analyze content of document
		-- length of word/sentences/paragraphs and the affect
			on readability

Basically, I would like any information that concerns the readability
and understanding of a document.

----Martin L. W. Hall---- Sun Microsystems 
HASA member in good standing
{allegra | hplabs}!sun!mlwh@sphinx or mlwh@sun.COM

----------------------------------------
Submissions to:   desktop%plaid@sun.com -OR- sun!plaid!desktop
Administrivia to: desktop-request%plaid@sun.com -OR- sun!plaid!desktop-request
Paths:  {ihnp4,decwrl,hplabs,seismo,ucbvax}!sun
Chuq Von Rospach	chuq@sun.COM		Delphi: CHUQ

Now, where did my ex-wife put my Fairy Dust?

chuq@plaid.UUCP (06/04/87)

Date: Thu, 4 Jun 87 13:38:38+0300
From: nsc!nsta!nsta.UUCP!iddo (Iddo Carmon /NSTA (052)-522-267)
Organization: National Semiconductor (Israel) Ltd.

>From: mlwh@sphinx (Martin Hall)
>I would be interested in finding out about document analysis.  I mean
>this in as general as a sense as you want to take it.  Any pointers
>would be appreciated.

My view is that this kind of activity is best handled by a proper mix of
human/machine interaction.  Consider the news system as an example: here
you have a massive amount of information to choose from, but still you're
able to handle it efficiently and select things that are relevant to you
by means of software utilities to various degrees of sophistication.

However, these utilities all rely on a set of conventions for putting
things in header lines that later enable the system to locate articles in
the newsgroup hierarchy, and on the intelligence of a human poster who
selects the proper newsgroups.  Also the structure of the newsgroup
hiereachy is developed by humans according to their interests and is a key
factor in the ease of selecting specific information.

Instead of treating a document as a 1-dimensional stream of characters and
trying to extract meaning from that, I'd like to see some common general-
purpose high-level 'document-programming' language evolving, that will be
used to annotate the text and will then enable automatic parsing of the
document into sections, threads of reasoning, selection of pieces by going
down a subject menue-tree, etc.  Such a convention may make it possible to
scan/archive documsnts according to their contents in numerous ways,
without a prerequisite for a "natural language understanding superexpert
system".

--
Iddo Carmon
Architecture Dept.                        Tel:  +972-52-522-267
National Semiconductor (Israel) Ltd.      uucp: ...!nsc!nsta!iddo
P.O.B. 3007, Herzlia B. 46104, Israel           {hplabs,pyramid,sun,decwrl}
----------------------------------------
Submissions to:   desktop%plaid@sun.com -OR- sun!plaid!desktop
Administrivia to: desktop-request%plaid@sun.com -OR-
sun!plaid!desktop-request
Paths:  {ihnp4,decwrl,hplabs,seismo,ucbvax}!sun

Chuq Von Rospach	chuq@sun.COM		Delphi: CHUQ

Now, where did my ex-wife put my Fairy Dust?

chuq%plaid@Sun.COM (Chuq Von Rospach) (06/11/87)

From: ames!hoptoad!localhost!killer!robm
Date: Mon, 8 Jun 87 19:41:31 CDT

> Date: Thu, 4 Jun 87 13:38:38+0300
> From: nsc!nsta!nsta.UUCP!iddo (Iddo Carmon /NSTA (052)-522-267)
> Organization: National Semiconductor (Israel) Ltd.
> 
> >From: mlwh@sphinx (Martin Hall)
> >I would be interested in finding out about document analysis.  I mean
> >this in as general as a sense as you want to take it.  Any pointers
> >would be appreciated.
> 
> .......
> 
> Instead of treating a document as a 1-dimensional stream of characters and
> trying to extract meaning from that, I'd like to see some common general-
> purpose high-level 'document-programming' language evolving, that will be
> used to annotate the text and will then enable automatic parsing of the
> document into sections, threads of reasoning, selection of pieces by going
> down a subject menue-tree, etc.  Such a convention may make it possible to
> scan/archive documsnts according to their contents in numerous ways,
> without a prerequisite for a "natural language understanding superexpert
> system".
> ........

You might take a look at the _Chicago Guide to Preparing Electronic
Manuscripts_, University of Chicago Press, 1987.  It contains a generic
page markup language.  

Rob Moser ---   ihnp4!killer!robm


----------------------------------------
Submissions to:   desktop%plaid@sun.com -OR- sun!plaid!desktop
Administrivia to: desktop-request%plaid@sun.com -OR- sun!plaid!desktop-request
Paths:  {ihnp4,decwrl,hplabs,seismo,ucbvax}!sun
Chuq Von Rospach	chuq@sun.COM		Delphi: CHUQ

Now, where did my ex-wife put my Fairy Dust?

chuq%plaid@Sun.COM (Chuq Von Rospach) (06/17/87)

From: David Boyes <dboyes@uoregon%tektronix.tek.com>
Date: 17 Jun 87 02:51:05 GMT
Organization: University of Oregon, Computer Science, Eugene OR

>> Instead of treating a document as a 1-dimensional stream of characters
>> ...  I'd like to see some common general-purpose high-level
>>  'document-programming' language evolving, ...
>
>     There is the ANSI Standardized Generalized Markup Language, which allows
>you to  describe a document in a way which relates to its content rather 

You also might want to check out University of Waterloo's SCRIPT/GML or
IBM DCF/GML implementations. Both are fairly good, if a bit artificial.
The facility that they define could easily be ported to just about any
text formatter -- I'm going to see what can be done with making a
TeX-GML on that basis...someday when I have time...8-).

-- 
David Boyes                   ARPA: 556%OREGON1.BITNET@WISCVM.WISC.EDU
Systems Division              BITNET: 556@OREGON1
University of Oregon Computing Center   UUCP: dboyes@uoregon.UUCP

----------------------------------------
Submissions to:   desktop%plaid@sun.com -OR- sun!plaid!desktop
Administrivia to: desktop-request%plaid@sun.com -OR- sun!plaid!desktop-request
Paths:  {ihnp4,decwrl,hplabs,seismo,ucbvax}!sun
Chuq Von Rospach	chuq@sun.COM		Delphi: CHUQ

Now, where did my ex-wife put my Fairy Dust?