[comp.std.internat] Character set adequacy and criteria for coverage

djbpitt@unix.cis.pitt.edu (David J Birnbaum) (05/01/91)

    Rich Schwartz's thoughtful and closely argued response to my recent
posting suggests a need to define more precisely some areas where we dis-
agree and to clarify some of the consequences of these disagreements.  In
the following discussion I use the term 'character' to mean 'machine unit
of representation' and 'grapheme' to mean 'human unit of representation'.
Because the issues are complex and I do not wish to misrepresent anything,
I have included fairly lengthy citations from RS's posting and I apologize
in advance for the length of what follows.

    The essense of my posting was:

DJB> ... we first decide what must be represented and only then can we
DJB> evaluate the ramifications of one or another system of representation.
DJB> And what must be represented is written culture, not just vendors'
DJB> databases of clients.

    RS responds with three points, which I will address in order:

1)  No adequate solution is possible and anything is a compromise; the is-
    sue should be defined not as adequate vs inadequate but as where the
    compromise should be made.

2)  What _must_ be included should be defined by the majority of users;
    special high-end users may be accommodated if doing so would not be too
    expensive, but their unusual requirements should not dictate policy.

3)  Glyph inventories (rendering) may have to be larger than character in-
    ventories (processing).

    The discussion below suggests the following:

1)  RS's arguments concerning #1 do not support his claim.

2)  RS's arguments in support of #2 are cogent, but our differences on this
    issue reflect two equally defensible social and political perspectives
    on internationalization, which I attempt to characterize in more gener-
    al terms.  I also raise an additional technical and historical argument
    in support of accommodating difficult (nonlinear) orthographies.

3)  RS's evaluation of #3 is insufficient; I agree that character and glyph
    inventories need not be isomorphic, but in certain instances the inven-
    tory for processing may have to be larger than the inventory for
    rendering.

1.  The possibility of adequacy
    ===========================

RS> A character set, like any data representation, is a combination of
RS> codes and processing rules that together comprise an abstraction of
RS> properties of real-world entities.  Every abstraction, no matter how
RS> well constructed, loses some information.  E.g., no program on any ma-
RS> chine will ever truly be able to represent *all* integers.

    I agree fully with RS's assumptions that 1) character sets are not
real-world data but abstractions that lose some information present in
realia, and 2) that the inventory of all integers is not representable.
But I do not believe that either of these considerations is relevant to the
problem of designing a character set or that RS's conclusions follow from
his assumptions.

    Orthography, or the inventory and disposition of graphemes used for
specific writing systems, is also an abstraction from language that neces-
sarily loses certain information.  Even if we agree, for the sake of argu-
ment, that all abstractions away from realia necessarily lose some informa-
tion, it is unclear why the recoding of one such abstraction (graphemes) as
another (characters) is inevitably subject to the same inexorable decay.
As a concrete example, transliteration systems that convert what is an ab-
stract representation of language data in one script to an abstract repre-
sentation in another script may be perfectly equivalent to one another with
no (additional) loss of information, as long as the correspondences are
biunique.

    One further detail is that the inventory of real-world integers is in-
finite and it is this property that makes this inventory unrepresentable.
RS has not demonstrated that the inventory of graphemes in human writing is
infinite and has therefore not demonstrated that its representation is sub-
ject to the same constraints as a representation of integers.  There is, to
be sure, an infinite number of things that people may write, but there is
no agreement that these are graphemes in the same way that there is univer-
sal agreement about what an integer is.  I do not want to pretend that the
set of all writable entities is fixed, only that the set of regular ortho-
graphic units is not infinite.  If we can have an alphabet, we can recode
this alphabet into machine characters.

    Thus, from the generally accepted assumptions that abstractions from
realia lose information and that the inventory of all integers cannot be
represented, RS attempts to draw a conclusion that is unrelated to these
assumptions: that isomorphic mapping between abstractions that constitute a
closed (although very large) set is impossible.

RS> Show me a machine that represents "all" integers, and I'll show it an
RS> integer that it can't represent.  We therefore compromise in our ab-
RS> straction, and settle for a representation that considers adequacy of
RS> coverage for the *vast majority* of the foreseen users to be paramount.

    RS suggests that no adequate (in my sense of the term) solution is at-
tainable, that compromises must be made, and that the only question is
where to compromise.  Since RS has not proven his suggestion that an ade-
quate solution is unattainable, I would like to recast the issue as "even
if an adequate solution is attainable, how do we decide whether it is worth
the expense?"

2.  Criteria for determining what to include
    ========================================

RS> ... critical design decisions that effect the methods by
RS>  which *all* characters in a standard are represented should
RS> not be made based solely (or paramountly ... ) on the "high--
RS> end" requirements.  I believe that the processing needs of
RS> the 99.99+% of users whose requirements fall short of this
RS> high-end of the spectrum should play a larger role in the de-
RS> sign of a new character set.

    I agree with RS that one must draw a line somewhere, but perhaps it
would be easier to decide where to draw the line if we were able to articu-
late a basis for our decision.  In the absence of any specific data, I as-
sume that the figure of 99.99+% is a rhetorical exaggeration.  I do not
dispute that there is a high end, but where does it begin?  How about
99.98%?  95%?  90%?  And how do we gather these figures?

    I assume that RS and I agree that the only candidates for inclusion are
those entities required by a recognizable and culturally significant con-
stituency.  This means that if I would like 10646 to include a character
that represents my pet beagle, Fred, I would be politely (or not so polite-
ly) shown to the private use zone, since there is no recognizable con-
stituency for which Fred is a grapheme.  But if the Israeli national stan-
dards body regards vowel points and cantillation marks as independent
graphemes whose representation as independent characters they consider im-
portant for processing purposes, the ISO must consider seriously the needs
of what clearly represents an identifiable and culturally significant con-
stituency (in the sense that minority cultures are no less important than
majority ones).  This does not mean that Hebrew (with separately coded
vowel points) is automatically admitted; RS might acknowledge that Hebrew
has a more legitimate claim than Fred and that the ISO should consider it
seriously, but he might conclude that it is nevertheless too "high-end" to
justify the expense.

RS> For some I'm sure this is sad but true: someone is going to
RS> pay for the development and implementation of any new charac-
RS> ter set by any vendor.

    If we discover a single nonlinear writing system that we agree must be
included, we must build in the mechanism to represent it.  Thus, if we de-
cide to include pointed Hebrew and treat vowel point graphemes as charac-
ters, 10646 will have to contain a mechanism for dealing with characters
that do not occupy linear space.  And if that mechanism must be developed
(at no small expense) for this portion of the character set, it is then
relatively inexpensive to apply it elsewhere in the same set (say, to ac-
cented early Cyrillic).

    RS might argue that supporting any nonlinear orthography, including
pointed Hebrew, will not pay for itself, which may be true and which raises
the question of whether "pay for itself" is the appropriate criterion.  One
purpose of international cooperative organizations is to balance the legit-
imate needs of both wealthier or more powerful larger bodies, who pay the
bills, and small constituencies, who must protect themselves and their cul-
tures from being overwhelmed by their more visible neighbors.

    Without wishing to overdo the analogy, the organization of the United
Nations into a General Assembly and a Security Council provides examples of
both solutions.  RS seems to be arguing for a Security Council approach to
character sets, dominated by the rich and powerful, while I am arguing for
a General Assembly approach, where minor writing systems have the same one
vote as major ones.  Each approach has its merits; the former recognizes
that it is the rich and powerful who will pay the costs of development and
who may therefore have a legitimate claim that they should be the ones to
set the agenda, while the latter recognizes that less influential cultures
may have legitimate interests no less significant than those of the rich
and powerful.

    Although both positions in this social and political issue find support
in the history of international cooperation, I believe there is a technical
and historical argument in favor of supporting small but legitimate con-
stituencies in the area of coded character sets.  The history of represent-
ing orthography in coded character sets has shown an inexorable expansion
to include more and more writing systems.  Whether it is convenient for
programmers and programming languages or not, cultures whose writing sys-
tems are not based on a system of purely linear writing _will_ continue to
enter the computer age.  At some point there will be enough of a market in
at least one of these countries that their writing system _will_ be
represented in coded character sets and this set _will_ be standardized.
We can either face this future now and design a character set architecture
that reflects the fact that languages with nonlinear writing will eventual-
ly be coded and standardized or we can cling to the past and pretend that
the current architecture, based on linear writing, will continue to serve
us.  If the (not insignificant) price will have to be paid eventually,
there may be some merit in paying it now, rather than paying a small price
now _and_ a larger price later.  Perhaps all designs are eventually
destined for obsolescence and it is impossible to predict the future, but
in those areas where we can see into the future we should not avert our
eyes.

3.  Processing (characters) vs rendering (glyphs)
    =============================================

RS> I believe that the way to compromise on this question is to
RS> recognize two very distinctly different classes of need: one
RS> is for *rendering* an extremely wide variety of written sym-
RS> bols, and the other is for *processing* a less extensive set.
RS> ... I suggest that not everything that appears to be rendered
RS> as text needs to be processed as text.

    I agree completely that rendering and processing needs may differ, but
I am afraid that to treat the inventory of entities to be processed as a
subset of the inventory to be rendered is oversimplified and inappropriate.
In my own experience of both processing and rendering orthographically com-
plex writing, I find the reverse is often true; graphemes that share cer-
tain physical features may be rendered similarly but must be discrete
characters if they are to be distinguished in processing. For example, at a
certain period in middle Russian writing, v and d had the identical shape;
these would have to be entered as separate characters if any processing ef-
fort is to distinguish them, but they might be mapped to a single glyph for
rendering.  While character and glyph inventories are not isomorphic, nei-
ther is automatically a subset of the other.

--David
=====================================================================
Professor David J. Birnbaum       djbpitt@vms.cis.pitt.edu [Internet]
The Royal York Apartments, #802   djbpitt@pittvms [Bitnet]
3955 Bigelow Boulevard            voice: 1-412-687-4653
Pittsburgh, PA  15213             fax:   1-412-624-9714

eliot@chutney.rtp.dg.com (Topher Eliot) (05/02/91)

In article <122779@unix.cis.pitt.edu>, djbpitt@unix.cis.pitt.edu (David J Birnbaum) writes:
...
|>     I also raise an additional technical and historical argument
|>     in support of accommodating difficult (nonlinear) orthographies.

What exactly does "linear" mean in this context?

-- 
Topher Eliot                           Data General DG/UX Internationalization
(919) 248-6371        62 T. W. Alexander Dr., Research Triangle Park, NC 27709
eliot@dg-rtp.dg.com                           {backbone}!mcnc!rti!dg-rtp!eliot
Obviously, I speak for myself, not for DG.

djbpitt@unix.cis.pitt.edu (David J Birnbaum) (05/03/91)

I wrote:

>I also raise an additional technical and historical argument in support
>of accommodating difficult (nonlinear) orthographies.

In response to which:

In article <1991May2.114119.2457@dg-rtp.dg.com> eliot@dg-rtp.dg.com writes:

>What exactly does "linear" mean in this context?

In some writing systems, such as English, graphemes are written in a
linear fashion: each one follows its predecessor in a straight line.
In a writing system like Hebrew, on the other hand, some graphemes
are written above their predecessors, some below, and some to the
left.  For English-like writing, a grapheme pretty much corresponds to
a horizontal position.  For Hebrew-like writing, it doesn't.

--David
=======================================================================
Professor David J. Birnbaum         djbpitt@vms.cis.pitt.edu [Internet]
The Royal York Apartments, #802     djbpitt@pittvms.bitnet   [Bitnet]
3955 Bigelow Boulevard              voice: 1-412-687-4653
Pittsburgh, PA  15123  USA          fax:   1-412-624-9714