pom@under..ARPA (Peter O. Mikes) (08/28/87)
Subject: generalised alphabets (letters + diacritics) One conclusion which follows from the last month of discussions on accented languages is, that we should be happy that we have ASCII and ISCII, and that at least one of them has a lasting place in both past and future. The other, being in accordance with MEGATRENDS, is that we do have and will have also a need for more general method (since French are not going to give up any cheese) of describing accented letters. So, taking some guidance from mathematics [ which can eat three alphabets for breakfest and still have Alef for lunch] how would you people feel about the following method FOR CODING of complex alphabets? 1) ASCII or ISCII is and will remain a plain vanilla work-horse 2) As time will pass, more and more groups (national or application) will ask for extensions. BTW: I recall that DP - in the Olden Times - used capitals only, (so did first typewriters ) - now - as pendulums swings back - we have unix excess of low-case-good / capital-case-bad era. 3) To accomodate such (eventual future) demand, in a manner which will not degenerate into an orgy of escapes in escapes, consider following: Lets say that we define three sets of signs: super-scripts, sub-scripts and (normal level) base-scripts. We than can create e.g. a french set by combining latin base with ` ^ o ... set of super-scripts, German by combining latin base with .. (umlaut), swedish with tiny o superscript and base level overstrike / etc.. We can have some mathematical symbols as letters with - + * etc (either above or next to letters) - and just about every other national alphabet. It is clear that any national group will need and will use (e.g. printers) which will produce what-ever ornamentations their forefathers dreamed up. By creating a standart - which will define any such set - in an agreed upon manner - we can still maintain some order in the madness: a) we do not force anybody to give up their favorite excess and b) we do not force ( by default) each national task force to dream-up their unique and (naturally incompatible) methods of achieving their goal Any comments? pom@under.s1.gov || @s1-under.UUCP
cik@l.cc.purdue.edu (Herman Rubin) (08/29/87)
In article <15488@mordor.s1.gov>, pom@under..ARPA (Peter O. Mikes) writes: > Subject: generalised alphabets (letters + diacritics) > One conclusion which follows from the last month of discussions > on accented languages is, that we should be happy that we have ASCII and ISCII > and that at least one of them has a lasting place in both past and future. Why? > The other, being in accordance with MEGATRENDS, is that we do have > and will have also a need for more general method (since French are not going > to give up any cheese) of describing accented letters. What we need is a very flexible method which does not _require_ (as in the roffs and TeX) that the user hit many keys for a result, and also allows whatever the user wants displayed when the user types it. I do not care if _you_ define a typed character to be e with a grave accent, or if _you_ require that for _your_ terminal you must type the accent first and then the e; but whatever method that _I_ use should produce the character on my screen, and I should be able to customize it to my requirements. At the present time, this is very difficult. For any particular language, it may be advisable to have a standard, which should be published, so that different people may produce the appropriate front- and back-ends for communications purposes. However, this should not be a standard for terminals. That ASCII is used for communication is no good reason why a terminal should use it. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (ARPA or UUCP) or hrubin@purccvm.bitnet
alan@pdn.UUCP (Alan Lovejoy) (08/31/87)
A Proposal: What is needed is to make a distinction between logical and physical characters, and to distinguish between speech sounds and particular audiographs (which can be further de-generalized into specific fonts). If every letter for any human alphabet, and every ideograph, were given a unique 32-bit (for example) id number, it would be possible to create a 'character look-up table' whose indices were 8-bit numbers (for example). ASCII would then become a particular "character palette" or mapping from 8-bit "logical" characters to 32-bit "physical" characters. If this sounds like color graphics/pixels/color-lookup-tables, that's because the idea was motivated thereby. Also, the possible human speech sounds should be given unigue id numbers (is 32 bits sufficient?), and a speech-sound to physical character translation table would be used to represent each speech sound with any desired character or sequence of characters. The character translation table and the speech sound translation table would become standard features of all operating systems, and could be defined at the beginning of each text file. The character translation table would either translate an 8-bit index directly into a physical character, or else into a speech-sound which could then be translated into (a) physical character(s). This option would be independently decided for each 8-bit character index. Text files that begin normally would be considered to be ASCII files, and the 8-bit codes would use the ASCII palette. With the proper esape sequences, palettes could be changed at any time, either by selecting a predefined palette, or defining new one. Can anyone suggest improvements on this? What do you think? --Alan@pdn
marty1@houdi.UUCP (M.BRILLIANT) (09/01/87)
In article <1209@pdn.UUCP>, alan@pdn.UUCP (Alan Lovejoy) writes: > A Proposal: > > If every letter for any human alphabet, and every ideograph, were given > a unique 32-bit (for example) id number, it would be possible to create > a 'character look-up table' ... > > Also, the possible human speech sounds should be given unique id > numbers (is 32 bits sufficient?), .... A 32-bit number could code any possible character that can be described in a square 16 pixels on a side. I think some Chinese ideographs are more complex than that. With run-length coding, the power of a 32-bit number is greater, and it might be adequate. I'm even less certain about the representation of speech sounds. The possible positions of the organs of articulation are continuously variable. Human speech can be as fast as about 10 distinct sounds per second. Coding each sound in 32 bits implies that a vocoder could encode human speech at 320 bits per second. Is that possible? M. B. Brilliant Marty AT&T-BL HO 3D-520 (201)-949-1858 Holmdel, NJ 07733 ihnp4!houdi!marty1
alan@pdn.UUCP (09/01/87)
In article <1296@houdi.UUCP> marty1@houdi.UUCP (M.BRILLIANT) writes: >A 32-bit number could code any possible character that can be described >in a square 16 pixels on a side. I think some Chinese ideographs are >more complex than that. With run-length coding, the power of a 32-bit >number is greater, and it might be adequate. An error and a misunderstanding: 1) The number of possible pictures in an m by n matrix of pixels, where each pixel can be one of c colors is c^(m*n). For a 16 x 16 matrix of black or white pixels, there are 2^256 possibilities and a 256-bit number is required. 2) This is NOT what I had in mind at all. I wanted a simple index to be assigned to each EXISTING character or ideogram, which index would be an abstraction independent of any particular graphical representation or font. > >I'm even less certain about the representation of speech sounds. The >possible positions of the organs of articulation are continuously >variable. Human speech can be as fast as about 10 distinct sounds per >second. Coding each sound in 32 bits implies that a vocoder could >encode human speech at 320 bits per second. Is that possible? > There is something called the International Phonetic Alphabet, which uses various modifiers to enable the transcription of any human speech sound with less than 256 symbols. This would probably be adequate, although the finer the resolution (the more characters representing different phones), the less complicated the system of modifiers need to be. pdn
pom@under.UUCP (09/01/87)
Subject: Re: generalised alphabets Newsgroups: sci.lang,comp.std.internat References: <15488@mordor.s1.gov> <1209@pdn.UUCP> In article <1296@houdi.UUCP> you write: >In article <1209@pdn.UUCP>, alan@pdn.UUCP (Alan Lovejoy) writes: >> A Proposal: >> >> If every letter for any human alphabet, and every ideograph, were given >> a unique 32-bit (for example) id number, it would be possible to create >> a 'character look-up table' ... >> >> Also, the possible human speech sounds should be given unique id >> numbers (is 32 bits sufficient?), .... > >I'm even less certain about the representation of speech sounds. The >possible positions of the organs of articulation are continuously Yes, the char.map in analogy to color.map is a good proposal; It complements my earlier proposal for creating a standart mechanism for coding of the national alphabets, rather then forcing same set of characters on everybody. Combining several proposals and ideas (and borowing also Device Independence from computer graphics) we get following: Certain applications require large data sets ( files ) to be interchanged between a) different machines and b)locations; In many cases these files can be represented by a sequence of small number (lets say N) of symbols (generalised characters). When we ignore digram frequencies, we need log2 N bits * L per file of Length L ( L symbols ); When frequency of bigram is exploited the number of bits decreases to M% of that. Lets call this log2(N) K1 and M*K1/100 K2 and K is either K1 or K2. Little bit of info for trivia lovers: Mr.Markov formulated his concept of Markov Chains while studying statistics of Russian language. Than Shanon gives M for english. What is it? We need agreed on mechanism, by which we tell to R (receiving computer): In the following stream of bits, take sets of K bits and interpret them using character lookup table T342 (for example) Interpret can , for example, mean the following: If the reciever is a) "an ASCII only" printer, which allows overstrikes then represent symbol i (one of N ) by this set of strikes, (e.g. A^ will be accented A ) b) a printer with 'apropriate' printweel (i.e. daisy with N spokes) represent the symbol i by single strike of spoke j(i).. c) if reciever is bit-mapped CRT, use graphic image stored in /user/public/T342 or defined by following (cgi) graphic primitives ... which can be scaled, skewed into cursive, underlined, capitalised,... etc ( It is patently wastefull to give a bit for any of these, since if you capitalise, it is OFTEN one char or a long sequence. The M and K2 introduced above makes this aspect quantitative and general) ............ etc.( char.map can include collating sequence and (as special Reciever) vocalisation ( that's more complex then just one-to-one phonem(char) but it can be coded within SAME framework as 'one of N phonems'. This covers all and any national alohabets, c sources AND IMPORTANTly the numerical data sets. I got an objection when I proposed as one special set of N=16 to be (generalised) digits ( i.e. 0,1...9, + - : (as triplet separator) EOF, etc.. The objection said: but we can do it in ASCII and we do not want to complicate this. My objection to the objection is as follow: Often we do it in ASCII, but not always: I worked on a "large-scale numerical simulation project " which had to ship Megabytes from Cray to Iris worksation (for display). ASCII was too slow, so extra coding was needed to interpret binary files. These applications will not go away, there will be more and more of them and we do not want to go to binaries - that's what #SCII should do, WITHOUT forcing me to ship a bit for case (capital, low case) with each symbol - thats absurd ( in many applications). pom@under.s1.gov || @s1-under.UUCP
marty1@houdi.UUCP (M.BRILLIANT) (09/02/87)
In article <1222@pdn.UUCP>, alan@pdn.UUCP (Alan Lovejoy) writes: > In article <1296@houdi.UUCP> marty1@houdi.UUCP (M.BRILLIANT) writes: > >A 32-bit number could code any possible character that can be described > >in a square 16 pixels on a side... > > An error .... > > (1) ..... For a 16 x 16 matrix > of black or white pixels, there are 2^256 possibilities and a 256-bit > number is required. Sorry, that was careless of me. A 32-bit number will encode almost all 6x6 characters, more with run-length coding. Not very impressive. > ... and a misunderstanding: > 2) This is NOT what I had in mind at all. I wanted a simple index to > be assigned to each EXISTING character or ideogram, which index would > be an abstraction independent of any particular graphical representation > or font. Then your scheme might have to change if another language is to be added, or if some language creates another ideograph. There are thousands of languages in the world, with I don't know how many different alphabets. I wanted to avoid limiting the scope to existing alphabets and prove that all possible alphabets could be included. Actually, I gather that Chinese ideographs follow certain conventions inscrutable to the alphabetic mind, and that there's a typesetting system for them. That would make it easier to represent them in a universal alphabet, even if they need more than one 32-bit number each. I hope other languages are as easy. > >I'm even less certain about the representation of speech sounds... > >..... Coding each sound in 32 bits implies that a vocoder could > >encode human speech at 320 bits per second. Is that possible? > > There is something called the International Phonetic Alphabet, which > uses various modifiers to enable the transcription of any human > speech sound with less than 256 symbols.... I'm not certain it's literally true that the IPA can describe any human speech sound, not just the sounds in a particular set of known languages. For instance, could it describe the tone system of Chinese? If that's true, you have a feasible proposal. M. B. Brilliant Marty AT&T-BL HO 3D-520 (201)-949-1858 Holmdel, NJ 07733 ihnp4!houdi!marty1
lee@uhccux.UUCP (Greg Lee) (09/02/87)
32 bits sounds about right to me for characterizing all the sounds of human languages. This is somewhat greater than the number of features proposed in The Sound Pattern of English, Chomsky and Halle, 1968. The feature system proposed there is a standard in linguistics. Since 1968 there have been many modifications proposed to this system, but none that would change the total of features greatly. The features have to do with positions and movements of the organs of articulation. Such a system of representation would be appropriate if one wished to characterize sounds sufficiently well so as to distinguish the pronunciations of different words for any language, homonyms aside, and also the dialect of the speaker, and something of the style of speech. One could not expect to represent sound well enough to preserve information about sex, age, or, e.g. presence of sinus infection. For the most part, the features in this system are binary-valued. An exception is the stress feature, but nowadays a commonly held position is that stress is more appropriately represented by attributing a structure to a string of sounds. So perhaps this is not a problem. It was also held by Chomsky and Halle that other features should have scalar values for detailed representations, but no grounds were given, and I don't believe it. However there are some arguments for scalar values in the literature. It's hard to see any present application outside linguistics for a text-processing system based on language-universal sound representation, since one doesn't find texts already represented this way, and there are no devices available to transcribe speech into such representations. The work that is being done on automatic transcription is, so far as I know, parochial and essentially unprincipled. * Greg Lee * U.S. mail: 562 Moore Hall, Dept. of Linguistics * University of Hawaii, Honolulu, HI 96822 * UUCP: {ihnp4,seismo,ucbvax,dcdwest}!sdcsvax!nosc!uhccux!lee * ARPA: uhccux!lee@bass.nosc.MIL
kent@xanth.UUCP (Kent Paul Dolan) (09/03/87)
Perhaps, to ease both the typing and the coding burden of accented and other non-linear alphabets, we could add or identify a terminal key and matching byte pair for overstrike, like the old card punch "mul pch" key. Back when I used these beasts, it didn't seem like much of a burden to hold it down while I punched the several characters to make the right pattern of holes in one card column. The goal for the typest, after all, is a bit of flexibility. One may want to type "a" then "raised circle", another the opposite order. If an overstrike key were implemented, and it specificly understood that the characters typed while it was held / bytes between the overstrike markers were order independent, this would take care of lots of languages which decorate letters, the infamous APL keyboard, and perhaps some other problems. Comments? (I get down this far in my newsgroups once in a blue moon, so email if you want an answer particlarly from me; post for general delight.) Kent, the man from xanth.
alan@pdn.UUCP (Alan Lovejoy) (09/03/87)
In article <1301@houdi.UUCP> marty1@houdi.UUCP (M.BRILLIANT) writes: >>[explanation that my proposal assigns some arbitrary 32-bit >>index/look-up key to each existing letter, symbol or ideogram] >Then your scheme might have to change if another language is to be >added, or if some language creates another ideograph. There are >thousands of languages in the world, with I don't know how many >different alphabets. I wanted to avoid limiting the scope to existing >alphabets and prove that all possible alphabets could be included. Actually, I now think that each letter/symbol/ideogram should be assigned a unique identifier/look-up key--of possibly varying lengths. New ones would have to be assigned stanard id's as they are invented. Systems would either support a small, fixed set of signs, or else maintain a database of signs that can be dynamically updated. The shape and styling of characters tend to change over time, and software normally should deal with the logical meaning of a sign, not its shape ("A" is "A", no matter what font it's in). >I'm not certain it's literally true that the IPA can describe any human >speech sound, not just the sounds in a particular set of known >languages. For instance, could it describe the tone system of >Chinese? If that's true, you have a feasible proposal. The IPA defines a consonant matrix of three dimensions, where one dimension is the type of articulation (plosive, fricative, glide...), another is the location (dental, alveolar, glottal...), and the last is for voicing (voiced, unvoiced). For vowels, there is vertical location (high, mid, low), horizontal location (front, mid, back), tenseness (lax, tense) and shape (wide, narrow). These categories enable one to produce a sound fairly close to the actual sound, and the range of sounds possible that still satisfy the description (front, middle, tense, narrow vowel) is usually about the same as the range of sounds produced by the speakers of a language. If not, there are special modifying symbols that say "a little more to the back", or "a little higher". There are also modifiers for nasality, tone, aspiration, palatalization, and all the other known variations. Does that seem satisfactory? --alan@pdn
andersa@kuling.UUCP (Anders Andersson) (09/15/87)
In article <2342@xanth.UUCP> kent@xanth.UUCP (Kent Paul Dolan) writes: >The goal for the typest, after all, is a bit of flexibility. One may >want to type "a" then "raised circle", another the opposite order. If >an overstrike key were implemented, and it specificly understood that >the characters typed while it was held / bytes between the overstrike >markers were order independent, this would take care of lots of >languages which decorate letters, the infamous APL keyboard, and >perhaps some other problems. I think keyboard design is a problem which should be kept separate from text representation. While there is a need for a common standard to identify the characters as abstract items and displaying them in an unambigous way on any screen or sheet of paper, few people will actually need the ability to type each character or ideograph themselves. Keyboards are very much subject to regional and individual taste, and I think they will continue to be so. Semi-advanced keyboards will of course contain some extra modifier keys for producing various "foreign" characters, although they will probably have separate keys for what's common locally. I wouldn't accept pressing two or three keys in a row just to produce some of the vowels which are particular to Swedish, but I wouldn't mind using that method for typing an occasional accented "e" in some French name for instance. Turkish keyboards won't care about these, but will probably have dotted and undotted "i" separated instead, and so on. Ideally, there should be a common keyboard interface standard, giving me the ability to bring my own (perhaps customized) keyboard when going abroad, and have the right things appear on the screen when I plug it into a Japanese workstation and start typing... That's flexibility! Keyboard layout standardisation is of course an important issue, but where to put which modifier keys and how to use them seems to belong more to the problem of QWERTY vs. AZERTY than to international, digital text representation. -- Anders Andersson, Dept. of Computer Systems, Uppsala University, Sweden Phone: +46 18 183170 UUCP: andersa@kuling.UUCP (...!{seismo,mcvax}!enea!kuling!andersa)