pom@under..ARPA (Peter O. Mikes) (08/28/87)
Subject: generalised alphabets (letters + diacritics) One conclusion which follows from the last month of discussions on accented languages is, that we should be happy that we have ASCII and ISCII, and that at least one of them has a lasting place in both past and future. The other, being in accordance with MEGATRENDS, is that we do have and will have also a need for more general method (since French are not going to give up any cheese) of describing accented letters. So, taking some guidance from mathematics [ which can eat three alphabets for breakfest and still have Alef for lunch] how would you people feel about the following method FOR CODING of complex alphabets? 1) ASCII or ISCII is and will remain a plain vanilla work-horse 2) As time will pass, more and more groups (national or application) will ask for extensions. BTW: I recall that DP - in the Olden Times - used capitals only, (so did first typewriters ) - now - as pendulums swings back - we have unix excess of low-case-good / capital-case-bad era. 3) To accomodate such (eventual future) demand, in a manner which will not degenerate into an orgy of escapes in escapes, consider following: Lets say that we define three sets of signs: super-scripts, sub-scripts and (normal level) base-scripts. We than can create e.g. a french set by combining latin base with ` ^ o ... set of super-scripts, German by combining latin base with .. (umlaut), swedish with tiny o superscript and base level overstrike / etc.. We can have some mathematical symbols as letters with - + * etc (either above or next to letters) - and just about every other national alphabet. It is clear that any national group will need and will use (e.g. printers) which will produce what-ever ornamentations their forefathers dreamed up. By creating a standart - which will define any such set - in an agreed upon manner - we can still maintain some order in the madness: a) we do not force anybody to give up their favorite excess and b) we do not force ( by default) each national task force to dream-up their unique and (naturally incompatible) methods of achieving their goal Any comments? pom@under.s1.gov || @s1-under.UUCP
cik@l.cc.purdue.edu (Herman Rubin) (08/29/87)
In article <15488@mordor.s1.gov>, pom@under..ARPA (Peter O. Mikes) writes: > Subject: generalised alphabets (letters + diacritics) > One conclusion which follows from the last month of discussions > on accented languages is, that we should be happy that we have ASCII and ISCII > and that at least one of them has a lasting place in both past and future. Why? > The other, being in accordance with MEGATRENDS, is that we do have > and will have also a need for more general method (since French are not going > to give up any cheese) of describing accented letters. What we need is a very flexible method which does not _require_ (as in the roffs and TeX) that the user hit many keys for a result, and also allows whatever the user wants displayed when the user types it. I do not care if _you_ define a typed character to be e with a grave accent, or if _you_ require that for _your_ terminal you must type the accent first and then the e; but whatever method that _I_ use should produce the character on my screen, and I should be able to customize it to my requirements. At the present time, this is very difficult. For any particular language, it may be advisable to have a standard, which should be published, so that different people may produce the appropriate front- and back-ends for communications purposes. However, this should not be a standard for terminals. That ASCII is used for communication is no good reason why a terminal should use it. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (ARPA or UUCP) or hrubin@purccvm.bitnet
alan@pdn.UUCP (Alan Lovejoy) (08/31/87)
A Proposal: What is needed is to make a distinction between logical and physical characters, and to distinguish between speech sounds and particular audiographs (which can be further de-generalized into specific fonts). If every letter for any human alphabet, and every ideograph, were given a unique 32-bit (for example) id number, it would be possible to create a 'character look-up table' whose indices were 8-bit numbers (for example). ASCII would then become a particular "character palette" or mapping from 8-bit "logical" characters to 32-bit "physical" characters. If this sounds like color graphics/pixels/color-lookup-tables, that's because the idea was motivated thereby. Also, the possible human speech sounds should be given unigue id numbers (is 32 bits sufficient?), and a speech-sound to physical character translation table would be used to represent each speech sound with any desired character or sequence of characters. The character translation table and the speech sound translation table would become standard features of all operating systems, and could be defined at the beginning of each text file. The character translation table would either translate an 8-bit index directly into a physical character, or else into a speech-sound which could then be translated into (a) physical character(s). This option would be independently decided for each 8-bit character index. Text files that begin normally would be considered to be ASCII files, and the 8-bit codes would use the ASCII palette. With the proper esape sequences, palettes could be changed at any time, either by selecting a predefined palette, or defining new one. Can anyone suggest improvements on this? What do you think? --Alan@pdn
marty1@houdi.UUCP (M.BRILLIANT) (09/01/87)
In article <1209@pdn.UUCP>, alan@pdn.UUCP (Alan Lovejoy) writes: > A Proposal: > > If every letter for any human alphabet, and every ideograph, were given > a unique 32-bit (for example) id number, it would be possible to create > a 'character look-up table' ... > > Also, the possible human speech sounds should be given unique id > numbers (is 32 bits sufficient?), .... A 32-bit number could code any possible character that can be described in a square 16 pixels on a side. I think some Chinese ideographs are more complex than that. With run-length coding, the power of a 32-bit number is greater, and it might be adequate. I'm even less certain about the representation of speech sounds. The possible positions of the organs of articulation are continuously variable. Human speech can be as fast as about 10 distinct sounds per second. Coding each sound in 32 bits implies that a vocoder could encode human speech at 320 bits per second. Is that possible? M. B. Brilliant Marty AT&T-BL HO 3D-520 (201)-949-1858 Holmdel, NJ 07733 ihnp4!houdi!marty1
alan@pdn.UUCP (09/01/87)
In article <1296@houdi.UUCP> marty1@houdi.UUCP (M.BRILLIANT) writes: >A 32-bit number could code any possible character that can be described >in a square 16 pixels on a side. I think some Chinese ideographs are >more complex than that. With run-length coding, the power of a 32-bit >number is greater, and it might be adequate. An error and a misunderstanding: 1) The number of possible pictures in an m by n matrix of pixels, where each pixel can be one of c colors is c^(m*n). For a 16 x 16 matrix of black or white pixels, there are 2^256 possibilities and a 256-bit number is required. 2) This is NOT what I had in mind at all. I wanted a simple index to be assigned to each EXISTING character or ideogram, which index would be an abstraction independent of any particular graphical representation or font. > >I'm even less certain about the representation of speech sounds. The >possible positions of the organs of articulation are continuously >variable. Human speech can be as fast as about 10 distinct sounds per >second. Coding each sound in 32 bits implies that a vocoder could >encode human speech at 320 bits per second. Is that possible? > There is something called the International Phonetic Alphabet, which uses various modifiers to enable the transcription of any human speech sound with less than 256 symbols. This would probably be adequate, although the finer the resolution (the more characters representing different phones), the less complicated the system of modifiers need to be. pdn
marty1@houdi.UUCP (M.BRILLIANT) (09/02/87)
In article <1222@pdn.UUCP>, alan@pdn.UUCP (Alan Lovejoy) writes: > In article <1296@houdi.UUCP> marty1@houdi.UUCP (M.BRILLIANT) writes: > >A 32-bit number could code any possible character that can be described > >in a square 16 pixels on a side... > > An error .... > > (1) ..... For a 16 x 16 matrix > of black or white pixels, there are 2^256 possibilities and a 256-bit > number is required. Sorry, that was careless of me. A 32-bit number will encode almost all 6x6 characters, more with run-length coding. Not very impressive. > ... and a misunderstanding: > 2) This is NOT what I had in mind at all. I wanted a simple index to > be assigned to each EXISTING character or ideogram, which index would > be an abstraction independent of any particular graphical representation > or font. Then your scheme might have to change if another language is to be added, or if some language creates another ideograph. There are thousands of languages in the world, with I don't know how many different alphabets. I wanted to avoid limiting the scope to existing alphabets and prove that all possible alphabets could be included. Actually, I gather that Chinese ideographs follow certain conventions inscrutable to the alphabetic mind, and that there's a typesetting system for them. That would make it easier to represent them in a universal alphabet, even if they need more than one 32-bit number each. I hope other languages are as easy. > >I'm even less certain about the representation of speech sounds... > >..... Coding each sound in 32 bits implies that a vocoder could > >encode human speech at 320 bits per second. Is that possible? > > There is something called the International Phonetic Alphabet, which > uses various modifiers to enable the transcription of any human > speech sound with less than 256 symbols.... I'm not certain it's literally true that the IPA can describe any human speech sound, not just the sounds in a particular set of known languages. For instance, could it describe the tone system of Chinese? If that's true, you have a feasible proposal. M. B. Brilliant Marty AT&T-BL HO 3D-520 (201)-949-1858 Holmdel, NJ 07733 ihnp4!houdi!marty1
lee@uhccux.UUCP (Greg Lee) (09/02/87)
32 bits sounds about right to me for characterizing all the sounds of human languages. This is somewhat greater than the number of features proposed in The Sound Pattern of English, Chomsky and Halle, 1968. The feature system proposed there is a standard in linguistics. Since 1968 there have been many modifications proposed to this system, but none that would change the total of features greatly. The features have to do with positions and movements of the organs of articulation. Such a system of representation would be appropriate if one wished to characterize sounds sufficiently well so as to distinguish the pronunciations of different words for any language, homonyms aside, and also the dialect of the speaker, and something of the style of speech. One could not expect to represent sound well enough to preserve information about sex, age, or, e.g. presence of sinus infection. For the most part, the features in this system are binary-valued. An exception is the stress feature, but nowadays a commonly held position is that stress is more appropriately represented by attributing a structure to a string of sounds. So perhaps this is not a problem. It was also held by Chomsky and Halle that other features should have scalar values for detailed representations, but no grounds were given, and I don't believe it. However there are some arguments for scalar values in the literature. It's hard to see any present application outside linguistics for a text-processing system based on language-universal sound representation, since one doesn't find texts already represented this way, and there are no devices available to transcribe speech into such representations. The work that is being done on automatic transcription is, so far as I know, parochial and essentially unprincipled. * Greg Lee * U.S. mail: 562 Moore Hall, Dept. of Linguistics * University of Hawaii, Honolulu, HI 96822 * UUCP: {ihnp4,seismo,ucbvax,dcdwest}!sdcsvax!nosc!uhccux!lee * ARPA: uhccux!lee@bass.nosc.MIL
kent@xanth.UUCP (Kent Paul Dolan) (09/03/87)
Perhaps, to ease both the typing and the coding burden of accented and other non-linear alphabets, we could add or identify a terminal key and matching byte pair for overstrike, like the old card punch "mul pch" key. Back when I used these beasts, it didn't seem like much of a burden to hold it down while I punched the several characters to make the right pattern of holes in one card column. The goal for the typest, after all, is a bit of flexibility. One may want to type "a" then "raised circle", another the opposite order. If an overstrike key were implemented, and it specificly understood that the characters typed while it was held / bytes between the overstrike markers were order independent, this would take care of lots of languages which decorate letters, the infamous APL keyboard, and perhaps some other problems. Comments? (I get down this far in my newsgroups once in a blue moon, so email if you want an answer particlarly from me; post for general delight.) Kent, the man from xanth.
alan@pdn.UUCP (Alan Lovejoy) (09/03/87)
In article <1301@houdi.UUCP> marty1@houdi.UUCP (M.BRILLIANT) writes: >>[explanation that my proposal assigns some arbitrary 32-bit >>index/look-up key to each existing letter, symbol or ideogram] >Then your scheme might have to change if another language is to be >added, or if some language creates another ideograph. There are >thousands of languages in the world, with I don't know how many >different alphabets. I wanted to avoid limiting the scope to existing >alphabets and prove that all possible alphabets could be included. Actually, I now think that each letter/symbol/ideogram should be assigned a unique identifier/look-up key--of possibly varying lengths. New ones would have to be assigned stanard id's as they are invented. Systems would either support a small, fixed set of signs, or else maintain a database of signs that can be dynamically updated. The shape and styling of characters tend to change over time, and software normally should deal with the logical meaning of a sign, not its shape ("A" is "A", no matter what font it's in). >I'm not certain it's literally true that the IPA can describe any human >speech sound, not just the sounds in a particular set of known >languages. For instance, could it describe the tone system of >Chinese? If that's true, you have a feasible proposal. The IPA defines a consonant matrix of three dimensions, where one dimension is the type of articulation (plosive, fricative, glide...), another is the location (dental, alveolar, glottal...), and the last is for voicing (voiced, unvoiced). For vowels, there is vertical location (high, mid, low), horizontal location (front, mid, back), tenseness (lax, tense) and shape (wide, narrow). These categories enable one to produce a sound fairly close to the actual sound, and the range of sounds possible that still satisfy the description (front, middle, tense, narrow vowel) is usually about the same as the range of sounds produced by the speakers of a language. If not, there are special modifying symbols that say "a little more to the back", or "a little higher". There are also modifiers for nasality, tone, aspiration, palatalization, and all the other known variations. Does that seem satisfactory? --alan@pdn