[comp.std.internat] generalised alphabets

pom@under..ARPA (Peter O. Mikes) (08/28/87)

Subject: generalised alphabets (letters + diacritics)

	One conclusion which follows from the last month of discussions 
on accented languages is, that we should be happy that we have ASCII and ISCII,
and that at least one of them has a lasting place in both past and future.

	The other, being in accordance with MEGATRENDS, is that we do have 
 and will have also a need for more general method (since French are not going
 to give up any cheese) of describing accented letters.

	So, taking some guidance from mathematics [ which can eat three
 alphabets for breakfest and still have Alef for lunch] how would you
 people feel about the following method FOR CODING of  complex alphabets? 

  1) ASCII or ISCII  is and will remain a plain vanilla work-horse 

  2) As time will pass, more and more groups (national or application)
	will ask for extensions. 

BTW:   I recall that DP  - in the Olden Times - used capitals only, 
 (so did first typewriters ) - now - as pendulums swings back - 
we have unix excess of   low-case-good   /  capital-case-bad   era.

 3) To accomodate such (eventual future) demand, in a manner which will
    not degenerate into an orgy of escapes in escapes,  consider  following: 

 Lets say that we define three sets of signs:
 super-scripts, sub-scripts and (normal level) base-scripts.  We than
 can create e.g. a french set by combining latin base with ` ^ o  ...
 set of super-scripts, German by combining latin base with .. (umlaut),
 swedish with tiny o superscript and base level overstrike / etc..

 We can have some mathematical symbols as letters with - + * etc (either
 above or next to letters) - and just about every other national alphabet.
 
 It is clear that any national group will need and will use (e.g. printers)
 which will produce what-ever ornamentations their forefathers dreamed up.
 By creating a standart - which will define any such set - in an agreed 
 upon manner - we can still maintain some order in the madness: 
a) we do not force anybody to give up their favorite excess and
b) we do not force ( by default) each national task force to dream-up
 their unique and (naturally incompatible) methods of achieving their goal
                                                           Any comments?

                                pom@under.s1.gov ||  @s1-under.UUCP

cik@l.cc.purdue.edu (Herman Rubin) (08/29/87)

In article <15488@mordor.s1.gov>, pom@under..ARPA (Peter O. Mikes) writes:
> Subject: generalised alphabets (letters + diacritics)
> 	One conclusion which follows from the last month of discussions 
> on accented languages is, that we should be happy that we have ASCII and ISCII
> and that at least one of them has a lasting place in both past and future.
Why?
> 	The other, being in accordance with MEGATRENDS, is that we do have 
>  and will have also a need for more general method (since French are not going
>  to give up any cheese) of describing accented letters.

What we need is a very flexible method which does not _require_ (as in the
roffs and TeX) that the user hit many keys for a result, and also allows 
whatever the user wants displayed when the user types it.  I do not care if
_you_ define a typed character to be e with a grave accent, or if _you_ require
that for _your_ terminal you must type the accent first and then the e;  but
whatever method that _I_ use should produce the character on my screen, and I
should be able to customize it to my requirements.  At the present time, this
is very difficult.

	For any particular language, it may be advisable to have a standard,
which should be published, so that different people may produce the appropriate
front- and back-ends for communications purposes.  However, this should not be
a standard for terminals.  That ASCII is used for communication is no good 
reason why a terminal should use it.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (ARPA or UUCP) or hrubin@purccvm.bitnet

alan@pdn.UUCP (Alan Lovejoy) (08/31/87)

A Proposal:

What is needed is to make a distinction between logical and physical
characters, and to distinguish between speech sounds and particular
audiographs (which can be further de-generalized into specific fonts).

If every letter for any human alphabet, and every ideograph, were given
a unique 32-bit (for example) id number, it would be possible to create
a 'character look-up table' whose indices were 8-bit numbers (for
example).  ASCII would then become a particular "character palette"
or mapping from 8-bit "logical" characters to 32-bit "physical"
characters.  If this sounds like color
graphics/pixels/color-lookup-tables, that's because the idea was
motivated thereby.

Also, the possible human speech sounds should be given unigue id
numbers (is 32 bits sufficient?), and a speech-sound to physical
character translation table would be used to represent each speech
sound with any desired character or sequence of characters.  The
character translation table and the speech sound translation table
would become standard features of all operating systems, and could
be defined at the beginning of each text file.  The character
translation table would either translate an 8-bit index directly 
into a physical character, or else into a speech-sound which could
then be translated into (a) physical character(s).  This option would
be independently decided for each 8-bit character index.

Text files that begin normally would be considered to be ASCII files,
and the 8-bit codes would use the ASCII palette.  With the proper
esape sequences, palettes could be changed at any time, either by 
selecting a predefined palette, or defining new one.

Can anyone suggest improvements on this?  What do you think?

--Alan@pdn

marty1@houdi.UUCP (M.BRILLIANT) (09/01/87)

In article <1209@pdn.UUCP>, alan@pdn.UUCP (Alan Lovejoy) writes:
> A Proposal:
> 
> If every letter for any human alphabet, and every ideograph, were given
> a unique 32-bit (for example) id number, it would be possible to create
> a 'character look-up table' ...
> 
> Also, the possible human speech sounds should be given unique id
> numbers (is 32 bits sufficient?), ....

A 32-bit number could code any possible character that can be described
in a square 16 pixels on a side.  I think some Chinese ideographs are
more complex than that.  With run-length coding, the power of a 32-bit
number is greater, and it might be adequate.

I'm even less certain about the representation of speech sounds.  The
possible positions of the organs of articulation are continuously
variable.  Human speech can be as fast as about 10 distinct sounds per
second.  Coding each sound in 32 bits implies that a vocoder could
encode human speech at 320 bits per second.  Is that possible?

M. B. Brilliant					Marty
AT&T-BL HO 3D-520	(201)-949-1858
Holmdel, NJ 07733	ihnp4!houdi!marty1

alan@pdn.UUCP (09/01/87)

In article <1296@houdi.UUCP> marty1@houdi.UUCP (M.BRILLIANT) writes:
>A 32-bit number could code any possible character that can be described
>in a square 16 pixels on a side.  I think some Chinese ideographs are
>more complex than that.  With run-length coding, the power of a 32-bit
>number is greater, and it might be adequate.

An error and a misunderstanding:

1) The number of possible pictures in an m by n matrix of pixels, where
each pixel can be one of c colors is c^(m*n).  For a 16 x 16 matrix 
of black or white pixels, there are 2^256 possibilities and a 256-bit
number is required.

2) This is NOT what I had in mind at all.  I wanted a simple index to
be assigned to each EXISTING character or ideogram, which index would
be an abstraction independent of any particular graphical representation
or font.

>
>I'm even less certain about the representation of speech sounds.  The
>possible positions of the organs of articulation are continuously
>variable.  Human speech can be as fast as about 10 distinct sounds per
>second.  Coding each sound in 32 bits implies that a vocoder could
>encode human speech at 320 bits per second.  Is that possible?
>

There is something called the International Phonetic Alphabet, which
uses various modifiers to enable the transcription of any human 
speech sound with less than 256 symbols.  This would probably be
adequate, although the finer the resolution (the more characters
representing different phones), the less complicated the system
of modifiers need to be.

pdn

pom@under.UUCP (09/01/87)

Subject: Re: generalised alphabets
Newsgroups: sci.lang,comp.std.internat
References: <15488@mordor.s1.gov> <1209@pdn.UUCP>

In article <1296@houdi.UUCP> you write:
>In article <1209@pdn.UUCP>, alan@pdn.UUCP (Alan Lovejoy) writes:
>> A Proposal:
>> 
>> If every letter for any human alphabet, and every ideograph, were given
>> a unique 32-bit (for example) id number, it would be possible to create
>> a 'character look-up table' ...
>> 
>> Also, the possible human speech sounds should be given unique id
>> numbers (is 32 bits sufficient?), ....
>
>I'm even less certain about the representation of speech sounds.  The
>possible positions of the organs of articulation are continuously

	Yes, the char.map in analogy to color.map is a good proposal;
 It complements my earlier proposal for creating a standart mechanism
 for coding of the national alphabets, rather then forcing same set
 of characters on everybody. Combining several proposals and ideas
 (and borowing also Device Independence from computer graphics) we 
 get following:
	Certain applications require large data sets ( files ) to be
	interchanged between a) different machines and b)locations;
	In many cases these files can be represented by a  sequence
	of small number (lets say N) of symbols (generalised characters).
        When we ignore digram frequencies, we need log2 N bits * L
	per file of Length L ( L symbols ); When frequency of bigram
	is exploited the number of bits decreases to M% of that.
	Lets call this log2(N) K1 and M*K1/100 K2 and K is either K1 or K2.
  Little bit of info for trivia lovers: Mr.Markov formulated his
  concept of Markov Chains while studying statistics of Russian 
  language. Than Shanon gives M for english. What is it?
      
      We need agreed on mechanism, by which we tell to R (receiving
      computer):  In the following stream of bits, take sets of K 
      bits and interpret them using character lookup table T342 (for example) 

      Interpret can , for example,  mean the following:
      If the reciever is  a)  "an ASCII only" printer, which
      allows overstrikes  then represent symbol i (one of N ) by
      this set of strikes, (e.g. A^ will be accented A )
			 b) a printer with 'apropriate'
     printweel (i.e. daisy with N spokes) represent the
     symbol i by single strike of spoke j(i)..
			c) if reciever is bit-mapped CRT,
    use graphic image stored in /user/public/T342 or defined
    by following (cgi) graphic primitives ... which can be
    scaled, skewed into cursive, underlined, capitalised,... etc
    ( It is patently wastefull to give a bit for any of these, since
    if you capitalise, it is OFTEN one char or a long sequence. The
    M and K2 introduced above makes this aspect quantitative and general)

	............ etc.( char.map can include collating sequence
	and (as special Reciever) vocalisation   (  that's 
	more complex then just one-to-one phonem(char) but it
	can be coded within SAME framework as 'one of N phonems'.

	This covers all and any national alohabets, c sources AND
	IMPORTANTly the numerical data sets. I got an objection
	when I proposed as one special set of N=16 to be (generalised)
	digits ( i.e. 0,1...9, + - : (as triplet separator) EOF, etc..
	The objection said: but we can do it in ASCII and we do not
	want to complicate this.
	My objection to the objection is as follow: Often we do it
	in ASCII, but not always: I worked on a "large-scale numerical
	simulation project " which had to ship Megabytes from Cray to
	Iris worksation (for display). ASCII was too slow, so extra
	coding was needed to interpret binary files. These applications
	will not go away, there will be more and more of them and we
	do not want to go to binaries - that's what #SCII should do,
	WITHOUT forcing me to ship a bit for case (capital, low case)
	with each symbol - thats absurd ( in many  applications).


                                pom@under.s1.gov ||  @s1-under.UUCP

marty1@houdi.UUCP (M.BRILLIANT) (09/02/87)

In article <1222@pdn.UUCP>, alan@pdn.UUCP (Alan Lovejoy) writes:
> In article <1296@houdi.UUCP> marty1@houdi.UUCP (M.BRILLIANT) writes:
> >A 32-bit number could code any possible character that can be described
> >in a square 16 pixels on a side...
> 
> An error ....
> 
> (1) .....  For a 16 x 16 matrix 
> of black or white pixels, there are 2^256 possibilities and a 256-bit
> number is required.

Sorry, that was careless of me.  A 32-bit number will encode almost all
6x6 characters, more with run-length coding.  Not very impressive.

> ... and a misunderstanding:

> 2) This is NOT what I had in mind at all.  I wanted a simple index to
> be assigned to each EXISTING character or ideogram, which index would
> be an abstraction independent of any particular graphical representation
> or font.

Then your scheme might have to change if another language is to be
added, or if some language creates another ideograph.  There are
thousands of languages in the world, with I don't know how many
different alphabets.  I wanted to avoid limiting the scope to existing
alphabets and prove that all possible alphabets could be included.

Actually, I gather that Chinese ideographs follow certain conventions
inscrutable to the alphabetic mind, and that there's a typesetting
system for them.  That would make it easier to represent them in a
universal alphabet, even if they need more than one 32-bit number
each.  I hope other languages are as easy.

> >I'm even less certain about the representation of speech sounds...
> >.....  Coding each sound in 32 bits implies that a vocoder could
> >encode human speech at 320 bits per second.  Is that possible?
> 
> There is something called the International Phonetic Alphabet, which
> uses various modifiers to enable the transcription of any human 
> speech sound with less than 256 symbols....

I'm not certain it's literally true that the IPA can describe any human
speech sound, not just the sounds in a particular set of known
languages.  For instance, could it describe the tone system of
Chinese?  If that's true, you have a feasible proposal.

M. B. Brilliant					Marty
AT&T-BL HO 3D-520	(201)-949-1858
Holmdel, NJ 07733	ihnp4!houdi!marty1

lee@uhccux.UUCP (Greg Lee) (09/02/87)

32 bits sounds about right to me for characterizing all the sounds of human
languages.  This is somewhat greater than the number of features proposed in
The Sound Pattern of English, Chomsky and Halle, 1968. The feature system
proposed there is a standard in linguistics.  Since 1968 there have been many
modifications proposed to this system, but none that would change the total of
features greatly.  The features have to do with positions and movements of the
organs of articulation.

Such a system of representation would be appropriate if one wished to
characterize sounds sufficiently well so as to distinguish the pronunciations
of different words for any language, homonyms aside, and also the dialect of
the speaker, and something of the style of speech.  One could not expect to
represent sound well enough to preserve information about sex, age, or, e.g.
presence of sinus infection.

For the most part, the features in this system are binary-valued.  An exception
is the stress feature, but nowadays a commonly held position is that stress is
more appropriately represented by attributing a structure to a string of
sounds.  So perhaps this is not a problem. It was also held by Chomsky and
Halle that other features should have scalar values for detailed
representations, but no grounds were given, and I don't believe it. However
there are some arguments for scalar values in the literature.

It's hard to see any present application outside linguistics for a
text-processing system based on language-universal sound representation, since
one doesn't find texts already represented this way, and there are no devices
available to transcribe speech into such representations.  The work that is
being done on automatic transcription is, so far as I know, parochial and
essentially unprincipled.

*   Greg Lee
*   U.S. mail: 562 Moore Hall, Dept. of Linguistics
*	       University of Hawaii, Honolulu, HI 96822
*     UUCP:	{ihnp4,seismo,ucbvax,dcdwest}!sdcsvax!nosc!uhccux!lee
*     ARPA:	uhccux!lee@bass.nosc.MIL

kent@xanth.UUCP (Kent Paul Dolan) (09/03/87)

Perhaps, to ease both the typing and the coding burden of accented and
other non-linear alphabets, we could add or identify a terminal key
and matching byte pair for overstrike, like the old card punch "mul
pch" key.  Back when I used these beasts, it didn't seem like much of
a burden to hold it down while I punched the several characters to
make the right pattern of holes in one card column.

The goal for the typest, after all, is a bit of flexibility.  One may
want to type "a" then "raised circle", another the opposite order.  If
an overstrike key were implemented, and it specificly understood that
the characters typed while it was held / bytes between the overstrike
markers were order independent, this would take care of lots of
languages which decorate letters, the infamous APL keyboard, and
perhaps some other problems.

Comments?  (I get down this far in my newsgroups once in a blue moon,
so email if you want an answer particlarly from me; post for general
delight.)

Kent, the man from xanth.

alan@pdn.UUCP (Alan Lovejoy) (09/03/87)

In article <1301@houdi.UUCP> marty1@houdi.UUCP (M.BRILLIANT) writes:
>>[explanation that my proposal assigns some arbitrary 32-bit
>>index/look-up key to each existing letter, symbol or ideogram]

>Then your scheme might have to change if another language is to be
>added, or if some language creates another ideograph.  There are
>thousands of languages in the world, with I don't know how many
>different alphabets.  I wanted to avoid limiting the scope to existing
>alphabets and prove that all possible alphabets could be included.

Actually, I now think that each letter/symbol/ideogram should be
assigned a unique identifier/look-up key--of possibly varying lengths.
New ones would have to be assigned stanard id's as they are invented.  
Systems would either support a small, fixed set of signs, or else maintain
a database of signs that can be dynamically updated.

The shape and styling of characters tend to change over time, and
software normally should deal with the logical meaning of a sign, not 
its shape ("A" is "A", no matter what font it's in).

>I'm not certain it's literally true that the IPA can describe any human
>speech sound, not just the sounds in a particular set of known
>languages.  For instance, could it describe the tone system of
>Chinese?  If that's true, you have a feasible proposal.

The IPA defines a consonant matrix of three dimensions, where one 
dimension is the type of articulation (plosive, fricative, glide...), 
another is the location (dental, alveolar, glottal...), and the last
is for voicing (voiced, unvoiced).  

For vowels, there is vertical location (high, mid, low), horizontal location
(front, mid, back), tenseness (lax, tense) and shape (wide, narrow).
These categories enable one to produce a sound fairly close to the
actual sound, and the range of sounds possible that still satisfy
the description (front, middle, tense, narrow vowel) is usually
about the same as the range of sounds produced by the speakers of
a language.  If not, there are special modifying symbols that say
"a little more to the back", or "a little higher".  There are also
modifiers for nasality, tone, aspiration, palatalization, and all the
other known variations.

Does that seem satisfactory?

--alan@pdn

andersa@kuling.UUCP (Anders Andersson) (09/15/87)

In article <2342@xanth.UUCP> kent@xanth.UUCP (Kent Paul Dolan) writes:
>The goal for the typest, after all, is a bit of flexibility.  One may
>want to type "a" then "raised circle", another the opposite order.  If
>an overstrike key were implemented, and it specificly understood that
>the characters typed while it was held / bytes between the overstrike
>markers were order independent, this would take care of lots of
>languages which decorate letters, the infamous APL keyboard, and
>perhaps some other problems.

I think keyboard design is a problem which should be kept separate from
text representation. While there is a need for a common standard to
identify the characters as abstract items and displaying them in an
unambigous way on any screen or sheet of paper, few people will actually
need the ability to type each character or ideograph themselves.

Keyboards are very much subject to regional and individual taste, and I
think they will continue to be so. Semi-advanced keyboards will of course
contain some extra modifier keys for producing various "foreign" characters,
although they will probably have separate keys for what's common locally.

I wouldn't accept pressing two or three keys in a row just to produce some
of the vowels which are particular to Swedish, but I wouldn't mind using
that method for typing an occasional accented "e" in some French name for
instance. Turkish keyboards won't care about these, but will probably have
dotted and undotted "i" separated instead, and so on.

Ideally, there should be a common keyboard interface standard, giving me
the ability to bring my own (perhaps customized) keyboard when going
abroad, and have the right things appear on the screen when I plug it into
a Japanese workstation and start typing... That's flexibility!

Keyboard layout standardisation is of course an important issue, but
where to put which modifier keys and how to use them seems to belong
more to the problem of QWERTY vs. AZERTY than to international, digital
text representation.
-- 
Anders Andersson, Dept. of Computer Systems, Uppsala University, Sweden
Phone: +46 18 183170
UUCP: andersa@kuling.UUCP (...!{seismo,mcvax}!enea!kuling!andersa)