rschwartz@OFFICE.WANG.COM (R. Schwartz@Wang R&D Net) (05/01/91)
in <message id fiendishly hidden by brain-damaged software... sorry> djbpitt@vms.cis.pitt.edu (Professor David J. Birnbaum) writes: > I would also be interested in seeing some discussion of the issues > raised in the posting to which I am responding. But it is misleading > to caricature the question of adequacy of coverage as quibbling over > the presence or absence of individual characters. The real issue is > whether a multilingual character set should be able to support texts > such as Professor Steensland's monograph. > This issue is not a squabble about a single missing character with > a funny name. It is about how the issue of separable diacritics > affects not only programming concerns; this issue affects in a sub- > stantial way the adequacy of coverage of the character set. And > the latter must be paramount; we first decide what must be represented > and only then can we evaluate the ramifications of one or another sys- > tem of representation. And what must be represented is written culture, > not just vendors' databases of clients. These arguments clearly represent several viewpoints which I must respectfully disagree with: A character set, like any data representation, is a combination of codes and processing rules that together comprise an abstraction of properties of real-world entities. Every abstraction, no matter how well constructed, loses some information. E.g., no program on any machine will ever truly be able to represent *all* integers. To state that "adequacy of coverage... must be paramount" is reflective of only one position in a spectrum of user requirements for character sets. Show me a machine that represents "all" integers, and I'll show it an integer that it can't represent. We therefore compromise in our abstraction, and settle for a representation that considers adequacy of coverage for the *vast majority* of the foreseen users to be paramount. I don't believe that, as worthy as such efforts are, works like Professor Steensland's are the ones that govern the design of a character set. Please note that I did not say a character set *should not* provide coverage for such a work. I merely say that critical design decisions that effect the methods by which *all* characters in a standard are represented should not be made based solely (or paramountly -- I know there are linguists reading this group, so if that isn't a word... mea culpa) on the "high-end" requirements. I believe that the processing needs of the 99.99+% of users whose requirements fall short of this high-end of the spectrum should play a larger role in the design of a new character set. For some I'm sure this is sad but true: someone is going to pay for the development and implementation of any new character set by any vendor. So please don't blame anyone for putting the written culture of the client base ahead of the more challenging written culture of a few additional "high-end" customers. I speak from experience on this matter: my company was a pioneer in bringing text processing to the mass market, with solid customer bases in business, government, academics, etc. in the US and abrod; yet for years my advocacy of larger character repertoires for our customers was largely un-heard, because the work involved was too large and the number of high-end customers willing to pay for the effort was too small. Colleagues with other vendors have told me that the same is true in their situations. (Please... don't insult me with "and look where your company is today..." Our character set coverage issues pale in comparison to the things that have put us where we are now.) I believe that the way to compromise on this question is to recognize two very distinctly different classes of need: one is for *rendering* an extremely wide variety of written symbols, and the other is for *processing* a less extensive set. Rendering Professor Steensland's work is a substantially different problem from processing it. I would argue that even if one considers the community of users who need to *see* Professor Steensland's work on line, the number whose efficient use of the information contained within it would be constrained by an inability to search for (or otherwise process) certain specific symbols is very small indeed. Ergo, I suggest that not everything that appears to be rendered as text needs to be processed as text. I emphatically am not suggesting that there be two character set standards (one for rendering and one for processing). I believe that the definition of a character set should be influenced more heavily by processing needs than by rendering needs, and that non-character (e.g. graphic or image) representations of symbols required in the written culture of certain linguistic, bibliographic, historical, or scientific works is a more useful solution to real world problems. For processing character data from a large repertoire, I believe that the 10646 design (flawed though it may be, too) is superior to Unicode's. I have many reasons, and I have informally passed them on to the Unicode folks. But my objections go the very philosophy of Unicode, so I never expected any reaction to them. Unicode should be respected for what it is: a very thorough synthesis of research and design efforts oriented towards solving the problem of character rendering rather than that of character processing. I just think that's the wrong problem. rich schwartz (All views expressed are my own, and not Wang Labs, Inc.'s.) rschwartz@office.wang.com VOICE (508) 967 5027 FAX (508) 967 0947 Wang Labs, Inc., M/S 019-58A, 1 Industrial Ave., Lowell, MA 01851
kkim@plains.NoDak.edu (kyongsok kim) (05/01/91)
In article <b4pjoo.f0r@wang.com> rschwartz@OFFICE.WANG.COM (R. Schwartz@Wang R&D Net) writes:
:E.g., no program on any machine will ever truly be able to
:represent *all* integers. To state that "adequacy of coverage... must be
:paramount" is reflective of only one position in a spectrum of user
:requirements for character sets.
As far as I know, #integers is NOT finite; however, #characters is FINITE
(although it may be VERY LARGE), isn't it? I have hard time figuring out
a parallelism here. Please correct me if I am missing something here?
k kim
rschwartz@OFFICE.WANG.COM (R. Schwartz@Wang R&D Net) (05/03/91)
I was a bit sloppy in the construction of my arguments in my original post. Prof. Birnbaum has pointed out a number of weaknesses, some of which reflect inaccuracies in my writing, and some of which indicate that I had not given enough thought to the issues. Kyongsok Kim has picked up on one of the issues that reflects both failings. I have yet to see Prof. Birnbaum's response posted, although I have an e-mail copy. When it appears, I will post some clarifications regarding issues that he raised, and I look forward to hearing from others about the basic socio-political difference between our respective positions. kkim@plains.NoDak.edu (kyongsok kim) writes: > As far as I know, #integers is NOT finite; however, #characters is FINITE > (although it may be VERY LARGE), isn't it? I have hard time figuring out > a parallelism here. Please correct me if I am missing something here? My intention was only to illustrate that we do make abstractions that are geared towards making implementations reasonable. However, I might argue that the analogy is fairly close: the set of characters is finite, but of unknown extent and with no provable upper bound on it's membership. This being the case, it is not possible to specify a maximum 'size' (either for a bit encoding, a byte-sequence encoding, a Turing machine, or any other mechanism) of a single character. I don't have trouble believing that the set of characters has no provable upper bound on membership. I also argue that even if there is a provable boundary today, that it is not static. I would contend that the size of the set must be monotonically increasing as new "written culture" (cf. DB's post) is created and/or discovered but (thanks to Historians, Linguists, Archaeologists, et al.) not often irrevocably lost. My argument on this point is lacking somewhat in intellectual rigor [ that's what eight years of implementation work can do to you :-) ], but I doubt I can be proven incorrect. I do not mean to imply that no abstraction can be devised to represent any member of an infinite set. If I recall correctly, X.409 / ASN.1 encoding provides an elegant method for representing arbitrary size integers. My intention was to illustrate that it is always possible to choose a member of the set of size sufficient to overflow available space in an implementation of any abstraction, either by exceeding fixed limits implied in the abstraction itself, or by exceeding limits on resources. For characters, this is as true as it is for integers. In a fixed-width bit encoding method, eventually one can always devise enough characters to exhaust the available values. In a completely open-ended variable-width- byte-sequence encoding, it is possible to construct a character that overflows storage limits. In a scheme in which a limited sequence of members of a subset of the fixed-width characters (i.e., diacritics) can be applied to a single member of a different subset of the fixed-width characters (i.e., base symbols), the base subset and diacritic subset are both instances of fixed-width bit encodings, ergo the available values of both may eventually be exhausted. So, yes, the analogy as I stated it initially is weak. Never-the-less, I believe that if we accept that the set of characters is finite but of unknown extent, the analogy does hold. And the purpose that I had in mind was to demonstrate that the pragmatic concerns of implementors are not only legitimate... they are inevitable. rich schwartz (All views expressed are my own, and not Wang Labs, Inc.'s.) rschwartz@office.wang.com VOICE (508) 967 5027 FAX (508) 967 0947 Wang Labs, Inc., M/S 019-58A, 1 Industrial Ave., Lowell, MA 01851