[comp.text] Public Domain Dictionary

ram@lynx.Berkeley.EDU (m.v.s. ramanath) (05/30/91)

There seems to be a serious need for a public domain dictionary.

Since dictionary publishers are unlikely to make their materials
available for free, I propose that we net folks create our own.

I assume that since English words and their meanings are not
copyrightable (though the specific wording of a meaning as published
in an extant dictionary may be). So as long as we come up with our own
definitions and don't lift stuff verbatim from published dictionaries
we should be ok. This won't of course be a scholarly and definitive
effort but should be useful in an everyday context.

Here is the proposed plan of action:

I estimate that a dictionary needs to have about 180,000 words to be
reasonably useful. That means that if we get about 500 (actually 494)
definitions typed in per day, we'll be done in a less than a year.

That means we need about 100 volunteers who each undertake to come up with
the definitions (in their own unique words) of 5 words per day. Or
50 who can do 10 per day.

So how about it folks ? Volunteers ?

If there is enough interest in this idea (either from potential users
of the dictionary or from potential volunteers) I'll flesh out the idea
some more. If there is a fatal flaw in the idea, I'd like to hear
about it.

Ram

======================================================================
Disclaimer: I speak for myself only and not for my employer
======================================================================
M.V.S. Ramanath                       |ram@imagen.com             or
QMS/Imagen                            |imagen!ram@sun.com         or
2650 San Tomas Expressway             |imagen!ram@decwrl.dec.com
Santa Clara, California, U.S.A. 95052 |
Phone: (408) 986-9400 (ext. 431)      |

lee@sq.sq.com (Liam R. E. Quin) (05/31/91)

ram@lynx.Berkeley.EDU (m.v.s. ramanath) writes:
>There seems to be a serious need for a public domain dictionary.

I've been thinking about exactly this issue for some time... and had even
got an article written (but not typed) along the same lines in the last
few days!

>That means we need about 100 volunteers who each undertake to come up with
>the definitions (in their own unique words) of 5 words per day. Or
>50 who can do 10 per day.

Of course, quality control and checks for regional variations are very
important.  Dictionary definitions are extraordinarily hard to write well.
Although perhaps we could do a plausible job, I doubt that Chambers or
Oxford or Webster need worry... :-)


I had half-planned the following:
* for each person writing definitions, there would be at least two people
  reading definitions with the ability to comment on them

* a writer is sent n randomly-chosen words (for example, 30 words taken from
  random usenet articles and other sources, subject to other checking)
  The software would keep a list of which words were sent to whom, of course;
  that's easy

* when the writer returns some or all of the words, the software sends the
  same number as were returned, crosses the received words off the list,
  and re-sends the un-returned ones.

* a writer can work at any rate, and can "refuse" to do some or all of th
  words.

* the words received are put on the list to be sent out to readers to check
  for typos, local variations (e.g. momentarily means different things to
  different people)..

The same sort of thing for words received from reader-people.

I'm even prepared to work on such software (as well as type words...)

Much of the challenge is to automate enough that no one person has to see
500 words a day, as that would be (to say the least) a full-time job.


For fun, by the way, I have some dictionary entries already, but they are
mostly from seventeenth century dictionaries :-)


Liam

-- 
Liam Quin, lee@sq.com, SoftQuad, Toronto, +1 416 963 8337
the barefoot programmer

npn@cbnewsl.att.com (nils-peter.nelson) (06/01/91)

I've been holding on to these, not knowing what to
do with them. Thanks for the opportunity. Now, for these
never-before-published definitions:
A is a letter.
Di- is a two letter prefix.
Tri- is a three letter prefix.
Quad- is a four letter prefix.
Penta- is a five letter prefix.
Hexads are groups of 6 letters.
Heptads are groups of 7 letters.
Octogram is an 8 letter word.
Nonagraph is a sequence of 9 letters.
Decagraphs are sequences of 10 letters.
Undecagraph is a sequence of 11 letters.
Duodecagraph is a sequence of 12 letters.
Tredecagraphs are sequences of 13 letters.
Quattuordecads are groups of 14 letters.
Quindecagraphic describes 15 letter words.
Sexdecagrammaton is a 16 letter word.
Septendecagraphic describes 17 letter words.
Octodecagrammatons are 18 letter words.
Novemdecagrammatons are 19 letter words.

gtoal@tardis.computer-science.edinburgh.ac.uk (06/01/91)

In article <1991May29.235751.1362@imagen.com> ram@lynx.Berkeley.EDU (m.v.s. ramanath) writes:
>There seems to be a serious need for a public domain dictionary.

Not many people know this... but...

there already is one -- on uk's UUG archive at uk.ac.ic.doc - but I
don't know how people on internet can get at it.  I think it's about
5Mb - it's called DICT.Z (rather unoriginally :) ) and I've no idea
of its provenance, except that it is clearly American.  It includes
a phonetic representation of most of the words in it too, for those
of you interested in that sort of thing.

Maybe someone at ic.doc could say where it came from?

Graham

gaynor@yoko.rutgers.edu (Silver) (06/01/91)

I don't have anything useful to contribute to this duscussion yet, but mainly
just wanted to acknowledge my interest and willingness to participate in a
minor role.

Regards, [Ag]

mh@awds23.imsd.contel.com (Mike Hoegeman) (06/02/91)

In article <1991May31.025805.24100@sq.sq.com> lee@sq.sq.com (Liam R. E. Quin) writes:
 ram@lynx.Berkeley.EDU (m.v.s. ramanath) writes:
 >There seems to be a serious need for a public domain dictionary.
 
 I've been thinking about exactly this issue for some time... and had even
 got an article written (but not typed) along the same lines in the last
 few days!
 
 
 Of course, quality control and checks for regional variations are very
 important.  Dictionary definitions are extraordinarily hard to write well.
 Although perhaps we could do a plausible job, I doubt that Chambers or
 Oxford or Webster need worry... :-)
  
 
 >I had half-planned the following:
 >* for each person writing definitions, there would be at least two people
 >reading definitions with the ability to comment on them

- - This sounds good to me. It might even be a good idea make a newsgroup
and let the author post their definition and let anyone who wishes
reply do so. The author can then after a suitable period revise their
definition. I think it would be nice to allow accompanying articles
(kind of like encyclopedia entries). For those who are ambitious.

 > * a writer is sent n randomly-chosen words (for example, 30 words
 > taken from
   > random usenet articles and other sources, subject to other checking)
   > The software would keep a list of which words were sent to whom, of
   > course; that's easy

 * when the writer returns some or all of the words, the software sends
 the
   same number as were returned, crosses the received words off the
   list, and re-sends the un-returned ones.

- - I can understand the reasons for issuing words randomly but I would
enjoy the project much more if I could pick some of the words I were to
write entries for. Maybe have a policy that for every assigned word you
write an entry for, you can write one of your own choosing. This would
make the "word check out" process software more complicated but worth
it in my opinion. This would also probably increase the quality of the
dictionary.

 > * a writer can work at any rate, and can "refuse" to do some or all of
 > the words.

 > * the words received are put on the list to be sent out to readers to
 > check
 > for typos, local variations (e.g. momentarily means different things
 > to different people)..



 >The same sort of thing for words received from reader-people.
 >I'm even prepared to work on such software (as well as type words...)

- - Me too...

 >Much of the challenge is to automate enough that no one person has to
 >see 500 words a day, as that would be (to say the least) a full-time
 >job.

 >For fun, by the way, I have some dictionary entries already, but they
 >are mostly from seventeenth century dictionaries :-)

- - I'm one of those people who love reading obscure dictionary entries
and other interesting lexicon. I think this could be a good piece of
reading material as well as a good desk reference. I would love to have
your 17th century entries just as much as something more than something
more run of the mill.

--------------------------------------------------------------------------
    mike hoegeman, 
    mh@awds.imsd.contel.com

jdc@naucse.cse.nau.edu (John Campbell) (06/03/91)

From article <1991Jun2.015314.5771@wlbr.imsd.contel.com>, by mh@awds23.imsd.contel.com (Mike Hoegeman):
: In article <1991May31.025805.24100@sq.sq.com> lee@sq.sq.com (Liam R. E. Quin) writes:
:  ram@lynx.Berkeley.EDU (m.v.s. ramanath) writes:
:  >There seems to be a serious need for a public domain dictionary.
:  
: - - This sounds good to me. It might even be a good idea make a newsgroup
: and let the author post their definition and let anyone who wishes
: reply do so. The author can then after a suitable period revise their
: definition. I think it would be nice to allow accompanying articles
: (kind of like encyclopedia entries). For those who are ambitious.
:

Isn't this how Douglas Adams wrote "Hitch Hikers Guide to the Galaxy"?
 
-- 
	John Campbell               jdc@naucse.cse.nau.edu
                                    CAMPBELL@NAUVAX.bitnet
	unix?  Sure send me a dozen, all different colors.

GONTER@awiwuw11.wu-wien.ac.at (Gerhard Gonter) (06/07/91)

There's clearly a need for a public domain online dictionary system
and there are already a lot of such things available on various
ftp sites etc.... Most of them are word lists, some of the contain
part-of-speech or other information as well.

Before we start compiling yet another one, we should sit back and
consider an encoding scheme which is flexible enough to meet a wide
variety of needs and applications. It's also a good idea to think
about a possible way to expand such an encoding scheme for applications
we don't even have an idea about now.

In the last few months I've collected all sorts of dictionary material
from virtually all over the world. This stuff was used then to create
an experimental lexicon which currently has more than 384000 entries,
mostly english words. Many entries have part-of-speech information
and data from a so called psycholinguistic database included.
Yet there are some more files to be processed. My problem about
this lexicon is, that it's already too large for my processing
capabilities (== hard disk storage space).

I'm very interested in a `public lexicon project' and I'm willing
to share the material that I've accumulated. I'm especially interested
in an encoding scheme, powerfull enough to meet whatever the needs
of interested users are.
- Such an encoding scheme would/should possibly be based on SGML.

Any comments/ideas?

p.s. Douglas Adam's Hitchhikers Guide could be a nice metaphor
     for such a project.
p.p.s: what about comp.text.lexicon ?

best wishes, Gerhard Gonter

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Gerhard Gonter                                 <GONTER@AWIWUW11.BITNET>
Tel: +43/1/31336/4578                   <gonter@awiwuw11.wu-wien.ac.at>
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

jkp@cs.HUT.FI (Jyrki Kuoppala) (06/08/91)

In article <91156.144339GONTER@awiwuw11.wu-wien.ac.at>, GONTER@awiwuw11 (Gerhard Gonter) writes:
>Before we start compiling yet another one, we should sit back and
>consider an encoding scheme which is flexible enough to meet a wide
>variety of needs and applications. It's also a good idea to think
>about a possible way to expand such an encoding scheme for applications
>we don't even have an idea about now.

I think it'd be a good idea to provide support for several languages
in the format.  Then of course there'll be some trouble with the
character sets, etc, but the work would be useful all over the world,
and perhaps it could be made into a multi-language freely
distributable dictionary.  Maybe it could also be useful as a
dictionary for automatic translation tools (ah well, then the format
would be quite complicated).

//Jyrki

lee@sq.sq.com (Liam R. E. Quin) (06/09/91)

GONTER@awiwuw11.wu-wien.ac.at (Gerhard Gonter) writes:
> There's clearly a need for a public domain online dictionary system

yes!

> Before we start compiling yet another one, we should sit back and
> consider an encoding scheme which is flexible enough to meet a wide
> variety of needs and applications. It's also a good idea to think
> about a possible way to expand such an encoding scheme for applications
> we don't even have an idea about now.

I agree...
 
> I'm very interested in a `public lexicon project' and I'm willing
> to share the material that I've accumulated.  I'm especially interested
> in an encoding scheme, powerfull enough to meet whatever the needs
> of interested users are.
> - Such an encoding scheme would/should possibly be based on SGML.

Well, that's a good idea too.

I think that the best way forward is to use a simple format that can easily
be transmitted over networks.  This implies a limited character set and
fairly short (<= 72 character) lines for many people.

The simple format can easily be converted to SGML.  It isn't easy writing
a DTD for a dictionary, so it is probably better not to try and do so at
first, although that doesn't preclude an SGML-style markup.  Another
alternative would be the Text Encoding Initiative DTD, but that's probably
more general than is appropriate.

> p.p.s: what about comp.text.lexicon ?

I think I'd rather see actual progress before a newsgroup, although I
suppose I could be persuaded to moderate such a thing.

Lee

-- 
Liam Quin, lee@sq.com, SoftQuad, Toronto, +1 416 963 8337
the barefoot programmer

lark@greylock.tivoli.com (Lar Kaufman) (06/11/91)

lee@sq.sq.com (Liam R. E. Quin) writes:
  GONTER@awiwuw11.wu-wien.ac.at (Gerhard Gonter) writes:
  > There's clearly a need for a public domain online dictionary system
  yes!
  
  > I'm very interested in a `public lexicon project' and I'm willing
  > to share the material that I've accumulated.  I'm especially interested
  > in an encoding scheme, powerfull enough to meet whatever the needs
  > of interested users are.
  > - Such an encoding scheme would/should possibly be based on SGML.

  Well, that's a good idea too.

Agreed.

  I think that the best way forward is to use a simple format that can easily
  be transmitted over networks.  This implies a limited character set and
  fairly short (<= 72 character) lines for many people.

Agreed.

  The simple format can easily be converted to SGML.  It isn't easy writing
  a DTD for a dictionary, so it is probably better not to try and do so at
  first, although that doesn't preclude an SGML-style markup.  Another
  alternative would be the Text Encoding Initiative DTD, but that's probably
  more general than is appropriate.

It would be interesting to see what the guys at U. of Waterloo did with the 
online O.E.D. project.  I understand it is very SGML-like.

  > p.p.s: what about comp.text.lexicon ?

  I think I'd rather see actual progress before a newsgroup, although I
  suppose I could be persuaded to moderate such a thing.

That would be a noble thing.  Perhaps a newsletter would be an appropriate 
dissemination method, though?

Lar Kaufman            I would feel more optimistic about a bright future
(voice) 512-794-9070   for man if he spent less time proving that he can
(fax)   512-794-0623   outwit Nature and more time tasting her sweetness 
lark@tivoli.com        and respecting her seniority.  - E.B. White

drraymond@watdragon.waterloo.edu (Darrell Raymond) (06/12/91)

>It would be interesting to see what the guys at U. of Waterloo did with the 
>online O.E.D. project.  I understand it is very SGML-like.
 
  The online OED is marked up with tags that are reminiscent of SGML.  
However, there is no DTD for the OED, or for many of the other markup 
projects that Oxford University Press has undertaken. Many existing
dictionaries have too much variance in their structure to be completely
captured by SGML.  Even deciding what sort of information you want to
capture in your markup is a subject of some controversy.

----------------

  Maybe you guys could stand a few comments on your project in general.
Basically, you've got three things to worry about:

  (i)   coverage
  (ii)  correctness
  (iii) finishing

  Coverage implies you have to find a source of words that gives us some 
confidence that your dictionary is comprehensive enough for whatever 
purpose you have in mind.  This means more than just finding instances of
every word, it means finding instances of most of the senses of the words.
The strength of any dictionary is its underlying corpus, the collection
of language from which the examples are drawn.  In the case of the OED,
this means 8 to 10 million quotations sent in by volunteer readers.  In 
the case of the Collins COBUILD dictionary it's a special online corpus 
of about 40 million words.

  Correctness means that unless you put in place some mechanism that'll
give us confidence in how you obtained your results, no one will be using 
(or at least depending on) your dictionary.  One such mechanism is that
old scholarly tradition, accountability.  For example, the OED provides 
you with the quotes used to define the entry, as well as bibliographic 
information, so you can go and check the quote in the original source if
you like.  Thus you can hold the OED and its editors accountable for
the decisions they made, because you can look at the same evidence.

  Finishing means that you ought to be aware of the fact that many a 
dictionary project takes decades longer than the original editors 
forecast.  Dictionary-writing is not a part-time activity. 

  Some comments on statements made in various postings:

>There seems to be a serious need for a public domain dictionary.

  My first question is, what for? I admit I didn't see the first posting 
in this thread.  Is it really the definitions you want, or just a word list 
with correct spellings and parts of speech (which would be fine for a lot 
of automatic uses)?  If you actually want to write a dictionary from 
scratch, good luck, you'll be at it a long time.  If it's only a word 
list that you want, you stand a better chance of completing.

>That means we need about 100 volunteers who each undertake to come up with
>the definitions (in their own unique words) of 5 words per day. Or
>50 who can do 10 per day.

  Goodness gracious.  10 words per day?  Just sit down and write me up
a definition of the word "good".  Make sure to cover as many senses and 
usages as you can think of.  Go check a couple of dictionaries and see 
how many senses you missed.  If it takes you less than an hour to do 
a good job I'd be surprised.  Now multiply that by 10.

  Just as you cannot get twice the software production by doubling the
number of programmers, you cannot get twice the dictionary by doubling
the number of volunteers who write definitions.

>Of course, quality control and checks for regional variations are very
>important.  

  Whoops, add to that hour per word all the checks you're going to do for 
regional variations and quality control.  Who has the final word on the 
quality of a definition, anyway?  

>Dictionary definitions are extraordinarily hard to write well.

  But you plan to do 5 to 10 a day?

>* a writer is sent n randomly-chosen words (for example, 30 words taken from
>  random usenet articles and other sources, subject to other checking)

  Usenet is not exactly what I would call a broadly based source of 
words (especially if you want them spelled correctly).

>- - This sounds good to me. It might even be a good idea make a newsgroup
>and let the author post their definition and let anyone who wishes
>reply do so. The author can then after a suitable period revise their
>definition. 

  When there are disputes, who is the final authority?  Since the author
is basically chosen at random, he or she probably has no more claim to
being the final authority than anyone else...

>I think it would be nice to allow accompanying articles
>(kind of like encyclopedia entries). For those who are ambitious.

  It would be nice - who's going to check them for correctness? What if 
some of them are sexist or racist?  Who decides what is permitted and what 
isn't?  Are the people who decide such things then exposing themselves to 
liability for lawsuits?
 
>- - I can understand the reasons for issuing words randomly but I would
>enjoy the project much more if I could pick some of the words I were to
>write entries for. 

  No doubt.  Who decides who gets the most popular words?  

----------------
  
  I'm not trying to throw a wet blanket on this project.   But imagine
if a bunch of lexicographers got together to rewrite Unix on a part-time
basis ('cause we need a public domain one, don'tcha know)....

enag@ifi.uio.no (Erik Naggum) (06/13/91)

Lar Kaufman <lark@greylock.tivoli.com> writes:
| 
|   It would be interesting to see what the guys at U. of Waterloo did
|   with the online O.E.D. project.  I understand it is very SGML-like.

They have used a portion of SGML's syntax, which, I'm sorry to say,
does not make it SGML-conformant.  As I heard the story, it was too
many inconsistencies in the original material to try to make a DTD for
OED2.

What I've seen of the OED2 is not pretty, so they obviously had a very
hard job figuring out how to encode it and actually do the encoding.
We're creating a dictionary from scratch (aren't we? :-), so we could,
perhaps, be better at consistency...

Let's try to look at how other people did their dictionary entries,
and what we would like to include, in what order, then test that by
submitting numerous entries to these constraints before continuing.
Changes in the structure is going to make a lot of people unhappy and
create a lot of unnecessary work.  (Converting between structures is
possible, but generally difficult.)

</Erik>
--
Erik Naggum             Professional Programmer            +47-2-836-863
Naggum Software             Electronic Text             <ERIK@NAGGUM.NO>
0118 OSLO, NORWAY       Computer Communications        <enag@ifi.uio.no>

GONTER@awiwuw11.wu-wien.ac.at (Gerhard Gonter) (06/14/91)

Path:
awiwuw11!psuvm!ysub!usenet.ins.cwru.edu!magnus.acs.ohio-state.edu!zaphod.mps.ohi
o-state.edu!
 usc!snorkelwacker.mit.edu!bloom-beacon!eru!hagbard!sunic!sics.se!fuug!funic!nnt
p.hut.fi!usenet
From: jkp@cs.HUT.FI (Jyrki Kuoppala)
Newsgroups: comp.text
Subject: Re: Public Domain Dictionary
Message-ID: <1991Jun8.145123.23369@nntp.hut.fi>
Date: 8 Jun 91 14:51:23 GMT
References: <1991Jun2.015314.5771@wlbr.imsd.contel.com>
<3797@naucse.cse.nau.edu> <91156.144339GONTER@awiwuw11.wu-wien.ac.at>
Sender: usenet@nntp.hut.fi (Usenet pseudouser id)
Reply-To: jkp@cs.HUT.FI (Jyrki Kuoppala)
Organization: Helsinki University of Technology, Finland
Lines: 16
In-Reply-To: GONTER@awiwuw11.wu-wien.ac.at (Gerhard Gonter)
Nntp-Posting-Host: sauna.hut.fi

> I think it'd be a good idea to provide support for several languages
> in the format.  Then of course there'll be some trouble with the
> character sets, etc, but the work would be useful all over the world,
> and perhaps it could be made into a multi-language freely
> distributable dictionary.  Maybe it could also be useful as a
> dictionary for automatic translation tools (ah well, then the format
> would be quite complicated).
>
> //Jyrki

As a native german speaker I fully agree with you.

best wishes, Gerhard Gonter

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Gerhard Gonter                                 <GONTER@AWIWUW11.BITNET>
Tel: +43/1/31336/4578                   <gonter@awiwuw11.wu-wien.ac.at>

tbray@watsol.waterloo.edu (Tim Bray) (06/24/91)

enag@ifi.uio.no (Erik Naggum) writes:
 They have used a portion of SGML's syntax, which, I'm sorry to say,
 does not make it SGML-conformant.  

It's not clear to me that an SGML-conformant dictionary is either necessary
or desirable.  A dictionary should be a small model of a human language.  
Not even SGML's strongest partisans claim for it an ability to model natural
language.

 As I heard the story, it was too
 many inconsistencies in the original material to try to make a DTD for
 OED2....
 What I've seen of the OED2 is not pretty, so they obviously had a very
 hard job figuring out how to encode it and actually do the encoding.

Indeed.  Frank Tompa of the New OED project at Waterloo, who has had a lot
of experience with online dictionaries, in co-operation with Bob Amsler, then
of Bellcore, now of Mitre, put in a lot of time and came up with a proposed
SGML def for dictionaries.  But it was tough, and even Tompa and Amsler
were left somewhat unsatisfied that they had covered the bases.

I have to disagree on the "not pretty" part.  Challenging, complex, somewhat
irregular, yes, all of those are true.  But this is no uglier than the 
English language that the OED is trying to describe.

Tim Bray, U of W Centre for the New OED -and- Open Text Systems

enag@ifi.uio.no (Erik Naggum) (06/25/91)

Tim Bray <tbray@watsol.waterloo.edu> writes:
|
|   It's not clear to me that an SGML-conformant dictionary is either
|   necessary or desirable.  A dictionary should be a small model of a
|   human language.  Not even SGML's strongest partisans claim for it
|   an ability to model natural language.

I must have missed something really crucial.  I have always thought
that a dictionary _entry_ is a structured unit of information in a
dictionary, containing other, smaller units of information, such as
word class, etymology, pronunciation, inflection, and a number of
definitions.  What is the relevance of "natural languages" in this?

SGML is a language in which you express the structure of information,
among other things, and _all_ information has _some_ structure, other-
wise it's noise.  SGML is suitable to express any kind of structure
which has a hierarchical nature, i.e. every element is contained in
toto in another element.  There are some cases where this is not true,
and SGML fails to handle those cases in the simplest way with
attribute-less tags, yet it can be done with tags and time-space
coordinates and reference points to describe start and stop of any
event, including overlapping spatial elements.

I don't think you have a good grip on what SGML is, but you're not
alone.  The only wish I have is that those who have nth-hand infor-
mation and knowledge on SGML please try to verify it, especially as
n approaches infinity.

|   Indeed.  Frank Tompa of the New OED project at Waterloo, who has
|   had a lot of experience with online dictionaries, in co-operation
|   with Bob Amsler, then of Bellcore, now of Mitre, put in a lot of
|   time and came up with a proposed SGML def for dictionaries.  But
|   it was tough, and even Tompa and Amsler were left somewhat
|   unsatisfied that they had covered the bases.

This is not totally relevant to the OED2 project.  The OED2 project
had some significant real life constraints to work with, such as an
existing dictionary.  It's very unlikely to have a very large number
of dictionary entries be consistent with any given structure, unless
that structure is so large it becomes chaotic and useless.  If you sat
down to work out a DTD ("SGML def"?) for a dictionary, you would spend
a large amount of time doing so, instead of randomly stuffing things
into dictionary entries with only intuitive guidelines for structur-
ing, so as not to confuse the poor user.  Document analysis and design
are truly _hard_ tasks, and require a lot more than people think.  The
complexity of the task of course grows with the complexity of the
document under analysis.  That doesn't mean it can't be done, which
you imply.  Once it's defined, it should also capture the way we can
best retrieve information from a given instance, such as a dictionary
entry.  Of course this is hard.  What did you expect?

|   I have to disagree on the "not pretty" part.  Challenging,
|   complex, somewhat irregular, yes, all of those are true.  But this
|   is no uglier than the English language that the OED is trying to
|   describe.

I don't understand how you can put both a description and the object
of a description into one big bag and get anything useful out of it.
To me, it looks like you suffer from a severe layering confusion,
wherein an abstraction (description) of an entity can be no different
in complexity than the entity itself.  This is a very naive view.
It's also remarkably counter-productive, as the main objective of
abstraction is to reduce complexity to a level where humans can com-
fortably deal with it.  I'm utterly amazed that this comes from one
who has worked with the OED2 dictionary project.

A description, or structure specification, or whatever, will neces-
sarily have to extract the essential elements of what is described or
specified.  Otherwise, it's useless, as one can turn only to the
described element and get a better idea of it.  "Essential", of
course, requires (human) intelligence and creativity in discovering
what is and is not essential.  The whole task of writing a definition
is centered around discarding the unimportant.  A document type like-
wise requires that one extract the essentials, according to one or a
few views, which have to be known explicitly by the designer.

It so happens that SGML is a language in which one can express the
interrelationships between elements of a hierarchical structure in
such a way as to produce a consistent type, of which any given
dictionary, dictionary entry, and on down, are instances.

I don't understand how you can claim that SGML can't model natural
languages.  It wasn't intended to, and the question is completely
irrelevant to the structuring of dictionary entries.  It's like
claiming that TeX can't model emotions, or that the programming
language C can't model sexual experiences.

</Erik>
--
Erik Naggum             Professional Programmer            +47-2-836-863
Naggum Software             Electronic Text             <erik@naggum.no>
0118 OSLO, NORWAY       Computer Communications        <enag@ifi.uio.no>