[comp.std.internat] rendering v. processing

rschwartz@OFFICE.WANG.COM (R. Schwartz@Wang R&D Net) (05/01/91)

in <message id fiendishly hidden by brain-damaged software... sorry>
djbpitt@vms.cis.pitt.edu (Professor David J. Birnbaum) writes:

> I would also be interested in seeing some discussion of the issues
> raised in the posting to which I am responding.  But it is misleading
> to caricature the question of adequacy of coverage as quibbling over
> the presence or absence of individual characters.  The real issue is
> whether a multilingual character set should be able to support texts
> such as Professor Steensland's monograph.

> This issue is not a squabble about a single missing character with
> a funny name.  It is about how the issue of separable diacritics
> affects not only programming concerns; this issue affects in a sub-
> stantial way the adequacy of coverage of the character set.  And
> the latter must be paramount; we first decide what must be represented
> and only then can we evaluate the ramifications of one or another sys-
> tem of representation.  And what must be represented is written culture,
> not just vendors' databases of clients.

These arguments clearly represent several viewpoints which I must respectfully
disagree with:

A character set, like any data representation, is a combination of codes and
processing rules that together comprise an abstraction of properties of
real-world entities.  Every abstraction, no matter how well constructed, loses
some information.  E.g., no program on any machine will ever truly be able to
represent *all* integers.  To state that "adequacy of coverage... must be
paramount" is reflective of only one position in a spectrum of user
requirements for character sets.  Show me a machine that represents "all"
integers, and I'll show it an integer that it can't represent.  We
therefore compromise in our abstraction, and settle for a representation
that considers adequacy of coverage for the *vast majority* of the foreseen
users to be paramount.

I don't believe that, as worthy as such efforts are, works like Professor
Steensland's are the ones that govern the design of a character set.  Please
note that I did not say a character set *should not* provide coverage for
such a work.  I merely say that critical design decisions that effect the
methods by which *all* characters in a standard are represented should not
be made based solely (or paramountly -- I know there are linguists reading
this group, so if that isn't a word... mea culpa) on the "high-end"
requirements.  I believe that the processing needs of the 99.99+% of
users whose requirements fall short of this high-end of the spectrum
should play a larger role in the design of a new character set.

For some I'm sure this is sad but true: someone is going to pay for the
development and implementation of any new character set by any vendor.  So
please don't blame anyone for putting the written culture of the client
base ahead of the more challenging written culture of a few additional
"high-end" customers.  I speak from experience on this matter: my company
was a pioneer in bringing text processing to the mass market, with solid
customer bases in business, government, academics, etc. in the US and
abrod; yet for years my advocacy of larger character repertoires for
our customers was largely un-heard, because the work involved was too large
and the number of high-end customers willing to pay for the effort was too
small.  Colleagues with other vendors have told me that the same is true
in their situations.  (Please... don't insult me with "and look where your
company is today..."  Our character set coverage issues pale in comparison to
the things that have put us where we are now.)

I believe that the way to compromise on this question is to recognize two
very distinctly different classes of need: one is for *rendering* an extremely
wide variety of written symbols, and the other is for *processing* a less
extensive set.  Rendering Professor Steensland's work is a substantially
different problem from processing it.  I would argue that even if one
considers the community of users who need to *see* Professor Steensland's
work on line, the number whose efficient use of the information contained
within it would be constrained by an inability to search for (or otherwise
process) certain specific symbols is very small indeed.  Ergo, I suggest that
not everything that appears to be rendered as text needs to be processed
as text.

I emphatically am not suggesting that there be two character set standards
(one for rendering and one for processing).  I believe that the definition
of a character set should be influenced more heavily by processing needs
than by rendering needs, and that non-character (e.g. graphic or image)
representations of symbols required in the written culture of certain
linguistic, bibliographic, historical, or scientific works is a more useful
solution to real world problems.

For processing character data from a large repertoire, I believe that the
10646 design (flawed though it may be, too) is superior to Unicode's.  I
have many reasons, and I have informally passed them on to the Unicode folks.
But my objections go the very philosophy of Unicode, so I never expected any
reaction to them.  Unicode should be respected for what it is: a very thorough
synthesis of research and design efforts oriented towards solving the problem
of character rendering rather than that of character processing.  I just think
that's the wrong problem.

rich schwartz   (All views expressed are my own, and not Wang Labs, Inc.'s.)
 rschwartz@office.wang.com      VOICE (508) 967 5027     FAX (508) 967 0947
     Wang Labs, Inc., M/S 019-58A, 1 Industrial Ave., Lowell, MA 01851

kkim@plains.NoDak.edu (kyongsok kim) (05/01/91)

In article <b4pjoo.f0r@wang.com> rschwartz@OFFICE.WANG.COM (R. Schwartz@Wang R&D Net) writes:

:E.g., no program on any machine will ever truly be able to
:represent *all* integers.  To state that "adequacy of coverage... must be
:paramount" is reflective of only one position in a spectrum of user
:requirements for character sets.

As far as I know, #integers is NOT finite; however, #characters is FINITE
(although it may be VERY LARGE), isn't it?  I have hard time figuring out
a parallelism here.  Please correct me if I am missing something here?

k kim

rschwartz@OFFICE.WANG.COM (R. Schwartz@Wang R&D Net) (05/03/91)

I was a bit sloppy in the construction of my arguments in my original post.
Prof. Birnbaum has pointed out a number of weaknesses, some of which reflect
inaccuracies in my writing, and some of which indicate that I had not given
enough thought to the issues.  Kyongsok Kim has picked up on one of the
issues that reflects both failings.

I have yet to see Prof. Birnbaum's response posted, although I have
an e-mail copy.  When it appears, I will post some clarifications regarding
issues that he raised, and I look forward to hearing from others about
the basic socio-political difference between our respective positions.

kkim@plains.NoDak.edu (kyongsok kim) writes:

> As far as I know, #integers is NOT finite; however, #characters is FINITE
> (although it may be VERY LARGE), isn't it?  I have hard time figuring out
> a parallelism here.  Please correct me if I am missing something here?

My intention was only to illustrate that we do make abstractions that are
geared towards making implementations reasonable.  However, I might argue
that the analogy is fairly close: the set of characters is finite, but of
unknown extent and with no provable upper bound on it's membership. This
being the case, it is not possible to specify a maximum 'size' (either for
a bit encoding, a byte-sequence encoding, a Turing machine, or any other
mechanism) of a single character.

I don't have trouble believing that the set of characters has no provable
upper bound on membership.  I also argue that even if there is a provable
boundary today, that it is not static.  I would contend that the size of the
set must be monotonically increasing as new "written culture" (cf. DB's post)
is created and/or discovered but (thanks to Historians, Linguists,
Archaeologists, et al.) not often irrevocably lost.  My argument on
this point is lacking somewhat in intellectual rigor [ that's what
eight years of implementation work can do to you :-) ], but I doubt
I can be proven incorrect.

I do not mean to imply that no abstraction can be devised to represent any
member of an infinite set.  If I recall correctly, X.409 / ASN.1 encoding
provides an elegant method for representing arbitrary size integers.  My
intention was to illustrate that it is always possible to choose a member of
the set of size sufficient to overflow available space in an implementation
of any abstraction, either by exceeding fixed limits implied in the
abstraction itself, or by exceeding limits on resources.

For characters, this is as true as it is for integers.  In a fixed-width
bit encoding method, eventually one can always devise enough characters
to exhaust the available values.  In a completely open-ended variable-width-
byte-sequence encoding, it is possible to construct a character that overflows
storage limits.  In a scheme in which a limited sequence of members of a subset
of the fixed-width characters (i.e., diacritics) can be applied to a single
member of a different subset of the fixed-width characters (i.e., base
symbols), the base subset and diacritic subset are both instances of
fixed-width bit encodings, ergo the available values of both may eventually
be exhausted.

So, yes, the analogy as I stated it initially is weak.  Never-the-less, I
believe that if we accept that the set of characters is finite but of
unknown extent, the analogy does hold.  And the purpose that I had in
mind was to demonstrate that the pragmatic concerns of implementors are not
only legitimate... they are inevitable.

rich schwartz   (All views expressed are my own, and not Wang Labs, Inc.'s.)
 rschwartz@office.wang.com      VOICE (508) 967 5027     FAX (508) 967 0947
     Wang Labs, Inc., M/S 019-58A, 1 Industrial Ave., Lowell, MA 01851