[net.internat] What do we REALLY want?

jr@inset.UUCP (Jim R Oldroyd) (10/08/85)

One of the things I preceive from discussions on international
UNIX, is that many people are merely thinking in terms of
enhancing existing software to solve one particular aspect
of a much larger multi-faceted problem.

I first asked the question: "What do we REALLY want?".  Ignore
issues of implementation for a moment - consider the situation
we find ourselves faced with.

It would be useful for me if I could, not only edit a file
containing English text, but also intersperse at will text in
a different language.  Simple.  OK, but what if I suggest that
the other language is Arabic?  And then, I want to go on and
combine this with characters in different fonts and point sizes.

I also have a printer with a programable character set.  I have
taken the trouble to ``design'' a character containing the
company logo, another with a pointed hand, bullets, pound signs
etc.  I look at them as ordinary single characters and I have
a need to manipulate files containing these characters using
my editor, grep, sort and so on.

I do not want to learn strange hieroglyphics like \(*p for the
Greek character pi, or the sequence \o'\(is_-' to get a British
Sterling symbol.

These are a few points.  To get what I want I will need
not only new utilities, but also special hardware such as a
bitmapped display terminal.

You can see that I need an extremely large and variable
character set.  It is not possible to construct that set out
of existing character sets, nor to expect a final set to
remain static.

I believe that the time is now ripe for the computer world
to take a jump from the traditional viewpoint and realize
that users' requirements in these days of networks and
typesetters are already far ahead of anything that an
enhanced character set can provide.

-- 
++ Jim R Oldroyd
++ jr@inset.UUCP
++ ..!mcvax!ukc!inset!jr

ncx@cheviot.uucp (Lindsay F. Marshall) (10/09/85)

In article <723@inset.UUCP> jr@inset.UUCP (Jim R Oldroyd) writes:
>
>I believe that the time is now ripe for the computer world
>to take a jump from the traditional viewpoint and realize
>that users' requirements in these days of networks and
>typesetters are already far ahead of anything that an
>enhanced character set can provide.
>

The mention of typesetters shows the way to go. Instead of a concept of
character set we need the printer's concept of 'font'*. Ok, that makes
programming a little difficult (!) as you have to know which font you're
in so as to understand the significance of the character, but it does
allow you to manage user defined symbols in a much cleaner way, and may
mean that you can get away without having to extend the number of bits
in a character so as to cope with every possible character set in the
world as your fonts need only contain the characters that you want them to.

------------------------------------------------------------------------------
Lindsay F. Marshall, Computing Lab., U of Newcastle upon Tyne, Tyne & Wear, UK
  ARPA  : lindsay%cheviot.newcastle.ac.uk@ucl-cs.arpa
  JANET : lindsay@uk.ac.newcastle.cheviot
  UUCP  : <UK>!ukc!cheviot!lindsay
-------------------------------------------------------------------------------

* fount n. A complete assortment of types of one sort, with all that is
necessary for printing that kind of letter. - Also (esp. USA) font.
[Fr. fonte - fondre - L. fundere, to cast]
	Chambers Twentieth Century Dictionary 1966 edition.

bill@inset.UUCP (Campbell) (10/09/85)

In article <723@inset.UUCP> jr@inset.UUCP (Jim R Oldroyd) writes:
>..... users' requirements in these days of networks and
>typesetters are already far ahead of anything that an
>enhanced character set can provide.

Here is just one example.

Almost all sites running UNIX, of any flavour, will have the
"standard" date command, plus the "standard" ctime, asctime routines in libc.
In other words, they are forced to use American date formats.

In the light of recent correspondence in net.general (USA != world), it will
come as no surprise that most citizens of the world find the inflexibility
of the date routines disappointing, or even offensive.

However, it is NOT going to be sufficient to provide fixed length translations
of the string portions of ctime output, to give, for example,

	Lun Sep 16 12:23:05 1985

There are two problems this does not address.  Firstly, one chooses a date
format to suit a particular purpose.  In some countries the layout of a 
date is required, for some legal purposes, to follow a fixed format, which is
very unlikely to be that given by ctime.  It may be that local requirements
mean the ctime output format has to be changed.  A quick survey shows that of
the 300-odd tools programs distributed with UNIX, 45 use ctime, and just over
half of those depend on the various words in the string being at *fixed*
locations.  Note that these are just the tools programs, no applications
were looked at.  There is going to be a porting problem here.

Secondly, rather a lot of people are based in areas which do not use the
24 hour GMT clock, let alone the Gregorian calendar.  Anyone got any ideas 
on what internationalisation could do (cheaply !) for them ?  Surely, in
MCMLXXXV it's not beyond the wit of man. :-)

stephen@dcl-cs.UUCP (Stephen J. Muir) (10/10/85)

In article <725@inset.UUCP> bill@inset.UUCP (Bill Fraser-Campbell) writes:
>Almost all sites running UNIX, of any flavour, will have the
>"standard" date command, plus the "standard" ctime, asctime routines in libc.
>In other words, they are forced to use American date formats.
>
>There are two problems this does not address.  Firstly, one chooses a date
>format to suit a particular purpose.  In some countries the layout of a 
>date is required, for some legal purposes, to follow a fixed format, which is
>very unlikely to be that given by ctime.  It may be that local requirements
>mean the ctime output format has to be changed.  A quick survey shows that of
>the 300-odd tools programs distributed with UNIX, 45 use ctime, and just over
>half of those depend on the various words in the string being at *fixed*
>locations.  Note that these are just the tools programs, no applications
>were looked at.  There is going to be a porting problem here.
>
>Secondly, rather a lot of people are based in areas which do not use the
>24 hour GMT clock, let alone the Gregorian calendar.  Anyone got any ideas 
>on what internationalisation could do (cheaply !) for them ?  Surely, in
>MCMLXXXV it's not beyond the wit of man. :-)

The answer is quite simple.  Put all the date conversion routines in the
Kernel code with system calls for user programs to fetch the date in string
form or whatever.  This way, local changes can be made to the kernel code to
accommodate variations, without having to recompile any programs.  Of course,
this kernel code would run in user mode (where possible) so as not to lock-out
other processes.

I stress that the kernel *must* still store the time internally in GMT.  This
is so that, e.g., tar tapes will have the correct time when taken to another
system.
-- 
UUCP:	...!seismo!mcvax!ukc!dcl-cs!stephen
DARPA:	stephen%lancs.comp@ucl-cs	| Post: University of Lancaster,
JANET:	stephen@uk.ac.lancs.comp	|	Department of Computing,
Phone:	+44 524 65201 Ext. 4599		|	Bailrigg, Lancaster, UK.
Project:Alvey ECLIPSE Distribution	|	LA1 4YR

crs@lanl.ARPA (10/10/85)

> I believe that the time is now ripe for the computer world
> to take a jump from the traditional viewpoint and realize
> that users' requirements in these days of networks and
> typesetters are already far ahead of anything that an
> enhanced character set can provide.

While we are at it, would it be asking too much for common sense and
the needs of touch typists to prevail in keyboard design?  The IBM-PC
keyboard is well known.  That of the VT-220 isn't much  better.  Who
designs these layouts?  Have they ever typed?

Item:  What is the one key that is used to enter *every* single line
of text?
			The return key!

Why, then, stick every off-the-wall key you can think of between the
home keys and the return key?  One key between the home keys and the
return key is acceptable; two is too many.  I happen to be typing this
on a VT-220, on which it is the vertical-bar/back-slash.  I don't
recall what it is on the IBM-PC but it is no more often used.  The
vertical-bar and the back-slash are fairly often used in Unix but
*not* as often as the return key.  Why not put it *outside* the return
key?  [Or where suggested at the end of the last item.]

Item:  The caps lock key is usually used infrequently, usually once
before typing at least a full word and often *only* once at the beginning
of a session to change to all caps for the *entire* session.  The control
key, on the other hand is used on a key by key basis.  That is to say,
the control key must be held down for *every* single control character
you want to type.  Why, then, put the lock key between the home keys
and the control key instead of the other way about (VT-220)?

Item:  The lock key on the VT-220 is a caps lock, not a *shift* lock.
Why, then, move the angle brackets to a separate key so that comma and
period are *both* unshifted and *shifted* versions of their respective
keys?  This is done on typewriter keyboards because *typewriters* have
*shift* lock, not caps lock.  On computer terminals, I can think of no
reason not to put the angle brackets at shift comma and shift period.
This would eliminate an unnecessary key.  [Perhaps this would have
been a good place to put vertical bar & back slash.]

I'm sure that the designer of this keyboard (and all the others)
thought that he or she had good reasons for using this layout.  I
happen to disagree.

Perhaps keyboard designers should all be required to learn touch
typing and then should be required to spend many hours typing on a
prototype of their creations before being allowed to select a final
design.
-- 
All opinions are mine alone...

Charlie Sorsby
...!{cmcl2,ihnp4,...}!lanl!crs
crs@lanl.arpa

lee@rochester.UUCP (Lee Moore) (10/10/85)

> I believe that the time is now ripe for the computer world
> to take a jump from the traditional viewpoint and realize
> that users' requirements in these days of networks and
> typesetters are already far ahead of anything that an
> enhanced character set can provide.
> 
> ++ Jim R Oldroyd

I think you may as well trash Unix then.  There are enough problems
with getting 8 bit characters let alone a universal character set.  I think
the best you can hope for is an 8-bit ISO character set that will cover
Western Europe.

If you want see an approach to universality that was done from scratch,
check-out the Xerox Star.  It essentially uses a 16-bit character set
that encodes many national character sets including all of Western Europe,
Greek, Russian and Japanese*.  Documents can contain any mix of languages.
Since Xerox won a Voice of America contract, they have been producing a new
alphabet a month.  Last month was Amharic, a language in Ethiopia.

lee

* side note on Japanese... you can't solve all of it.  Xerox is following
the standards produced by the JIS.
-- 
TCP/IP:		lee@rochester.arpa
UUCP:		{seismo, allegra, decvax, cmcl2, topaz, harvard}!rochester!lee
XNS:		Lee Moore:CS:Univ Rochester
Phone:		+1 (716) 275-7747, -5671
Physical:	43 01' 40'' N, 77 37' 49'' W

-- 11 months 'till I drop off the face of the earth.

roy@phri.UUCP (Roy Smith) (10/10/85)

Referring to the VT-220:

> I'm sure that the designer of this keyboard (and all the others)
> thought that he or she had good reasons for using this layout.  I
> happen to disagree.
>
> Charlie Sorsby, {cmcl2,ihnp4}!lanl!crs

	I agree with Charlie.  After an infinite variety of keyboard
layouts by various manufacturers, DEC had finally come up with a de-facto
standard with the VT-100 layout (also on the LA-120's, etc.)  So why change
it?

	The return key is off in right field somewhere, the business with
the "<>,." keys looses badly, and if you are running emacs, you have to go
hunting for the escape key.

	But then, what does this have to do with internationalization?
-- 
Roy Smith <allegra!phri!roy>
System Administrator, Public Health Research Institute
455 First Avenue, New York, NY 10016

mike@erix.UUCP (Mike Williams) (10/11/85)

In article <723@inset.UUCP> jr@inset.UUCP (Jim R Oldroyd) writes:
>
>
>I first asked the question: "What do we REALLY want?".  Ignore
>issues of implementation for a moment - consider the situation
>we find ourselves faced with.
>
It all depend on for what we want to use UNIX. I use UNIX as a collection of
programming tools. However if I wanted to use UNIX for word processing, I
might buy a good word processing package. If that's all I wanted to do, I
could write a special shell for word processing. I think that these programs
themselves could deal with national character sets without disturbing standard
UNIX.

But really, I wouldn't do it that way at all. I would buy a Mac. Word
processor users don't want UNIX type interfaces. Mice and icons and such
like are so much better. Of course I could build these on top of UNIX, but
why bother when I can buy other systems like SmallTalk with all these things
anyway?

I suppose that all this rambling is just asking "Is there any point in an
international UNIX?". 

Mike Williams.

...mcvax!enea!erix!mike

[PS all these opinions are of course my own and do not represent those of my
employer, my wife, my child or my cat.]

tmb@talcott.UUCP (Thomas M. Breuel) (10/11/85)

In article <668@dcl-cs.UUCP>, stephen@dcl-cs.UUCP (Stephen J. Muir) writes:
> The answer is quite simple.  Put all the date conversion routines in the
> Kernel code with system calls for user programs to fetch the date in string
> form or whatever.  This way, local changes can be made to the kernel code to
> accommodate variations, without having to recompile any programs.  Of course,
> this kernel code would run in user mode (where possible) so as not to lock-ou
> other processes.

You don't really mean to put date conversion routines into the *kernel*,
do you? The 4.2 kernel already is much too big, too unwieldy, and has
far too many system calls. I am sure it is the same for other modern
versions of UN*X (in fact I know it for certain).

The point at which country specific routines are combined with a program
is usually linking. This means that you don't have to re-compile, but
just to re-link your binaries if you want to generate them for a different
country. It seems to me that what you really want is run-time linking
for the functionality and shared locked libraries for the efficiency,
an addition to UN*X that is even justified on grounds other than
this national date format silliness.

If you want to see what generality in terms of character sets, dates,
string comparisons, &c does to an operating system, just look at the
M*cIntosh ROM. It is a mess, and not much is gained by it, since most
(American) programs don't use the facilities provided by the operating
system anyhow (i.e. string comparisons are, of course, done numerically).

						Thomas.

flaps@utcs.uucp (Alan J Rosenthal) (10/13/85)

In article <725@inset.UUCP> bill@inset.UUCP (Bill Fraser-Campbell) writes:
>However, it is NOT going to be sufficient to provide fixed length translations
>of the string portions of ctime output, to give, for example,
>
>	Lun Sep 16 12:23:05 1985
>
>There are two problems this does not address.  Firstly, one chooses a date
>format to suit a particular purpose.  In some countries the layout of a 
>date is required, for some legal purposes, to follow a fixed format, which is
>very unlikely to be that given by ctime.  It may be that local requirements
>mean the ctime output format has to be changed.  A quick survey shows that of
>the 300-odd tools programs distributed with UNIX, 45 use ctime, and just over
>half of those depend on the various words in the string being at *fixed*
>locations.  Note that these are just the tools programs, no applications
>were looked at.  There is going to be a porting problem here.

How about two routines, a ctime which returns american date format and a
locctime which returns local date format, or some such?  And also a routine
adatetolocdate (ugh) which converts a given date to local date format, what
ever that may be?  Then any software can use ctime to pick out bits of the
date by fixed location, still remaining portable, but date(1) and anything
else can use locctime.  Furthermore old software which was unfriendly (ie
uses only american date format) would still work, and probably be easily
patched, perhaps with adatetolocdate.  In fact, programs like readnews could
convert to local date with this, even though the news article only contains
american-style date.

I might be missing something important, not being very familiar with this
kind of stuff, but it seems like a good idea which would not cause transition
problems.

Alan J Rosenthal					decvax!utzoo!utcs!flaps
--
Note: I am not employed by University of Toronto Computer Science Department
      or Computer Services, or anything else that would come to mind.

inc@fluke.UUCP (Gary Benson) (10/15/85)

> Would it be asking too much for common sense and
> the needs of touch typists to prevail in keyboard design?
>
> ... [ many examples ] ...
>
> I'm sure that the designer of this keyboard (and all the others)
> thought that he or she had good reasons for using this layout.  I
> happen to disagree.
> 
> Perhaps keyboard designers should all be required to learn touch
> typing and then should be required to spend many hours typing on a
> prototype of their creations before being allowed to select a final
> design.
>
> Charlie Sorsby


*** REPLACE THIS LINE ***

Hear, hear! Those who design *anything* should be required to use it! This
is paricularly true of human interface items such as keyboards, displays,
and "error" messages. A few weeks ago, I was complaining to a programmer
about how cumbersome the thing was to use, and was told, "Work with it a
while- you'll get used to it."

Well I'm getting sick and tired of having to "get used to it". As soon as
someone tells me that, a little voice inside me says, "Uh oh. Another
schlock job". Keyboards and where you put the different keys are perfect
candidates for ergonomists, but somehow the old attitude prevails that says,
"The qwerty keyboard is too universally familiar to change". Horse pucky.
And horse pucky to every "designer" who never even talks to a person who
will be using his product.




-- 
 Gary Benson  *  John Fluke Mfg. Co.  *  PO Box C9090  *  Everett WA  *  98206
   MS/232-E  = =   {allegra} {uw-beaver} !fluke!inc   = =   (206)356-5367
 _-_-_-_-_-_-_-_-ascii is our god and unix is his profit-_-_-_-_-_-_-_-_-_-_-_ 

rcd@opus.UUCP (Dick Dunn) (10/16/85)

> >I believe that the time is now ripe for the computer world
> >to take a jump from the traditional viewpoint and realize
> >that users' requirements in these days of networks and
> >typesetters are already far ahead of anything that an
> >enhanced character set can provide.
>...
> The mention of typesetters shows the way to go. Instead of a concept of
> character set we need the printer's concept of 'font'*...

It's true that we need to understand the world of typesetting, and also
that it gives us some clues about how to proceed from where we are today,
but be careful--the concepts of `font' and `character set' are two entirely
different ideas.  To specify printed material, you specify (among other
things) the characters to be printed and the font to be used in printing
them.  The abstraction `character' is meaningful quite independent of the
font used to represent characters.  For example, spelling and collating are
done without regard to font.  Consider ligatures in the sense they are used
in typesetting English if you need to sort out your ideas about characters
and fonts.  A character is some magic abstraction of an atomic entity at
some level.  A font provides a specific set of physical realizations
(concrete notations) for certain characters.

HOWEVER, the printer's concept of font (or fount, over there) illustrates
the peril of considering character set as a simple, immutable concept.  One
ordinary font might have 150 or so characters.  The total number of
characters possible is in the thousands (at least?)  What do we use for a
character set?  If we choose some small (<200) set of common characters,
how do we represent the rest?  If we attempt to choose some large (>>1000)
set of characters, we come up with the questionable idea that 90%+ of our
characters will not be representable in any given font we pick!  Some sort
of hybrid approach seems necessary somehow.(waffle waffle)  And we certainly
don't want to have to deal with more than one representation of a given
character (e.g., for different fonts) when we are only interested in the
underlying information rather than the presentation.  (The "content" vs.
"presentation" distinction is a key one!)
-- 
Dick Dunn	{hao,ucbvax,allegra}!nbires!rcd		(303)444-5710 x3086
   ...Simpler is better.

peter@graffiti.UUCP (Peter da Silva) (10/16/85)

> But really, I wouldn't do it that way at all. I would buy a Mac. Word
> processor users don't want UNIX type interfaces. Mice and icons and such
> like are so much better. Of course I could build these on top of UNIX, but
> why bother when I can buy other systems like SmallTalk with all these things
> anyway?

Because you can't get SmallTalk (if you ever considered UNIX to be a resource
hog, have a look at SmallTalk some time), and the Mac user interface is
running on a horrid CP/M-like operating system. UNIX is an excellent base
to build all sorts of special-purpose systems: it's small & fast, expert
friendly (remember, someone has to write the user friendly interface), and
widely available.

seifert@hammer.UUCP (Snoopy) (10/16/85)

In article <960@erix.UUCP> mike@erix.UUCP (Mike Williams) writes:

>But really, I wouldn't do it that way at all. I would buy a Mac. Word
>processor users don't want UNIX type interfaces. Mice and icons and such
>like are so much better.

That's one opinion.  I wouldn't buy a Mac.  I find the mouse and those
cutsey-pooh icons a pain to use.  Some people find the Mac-style
interface easier to use, some find the Unix-style interface easier
to use.  There's room for both.

What was the first thing Unix was used for? TEXT PROCESSING! There
are all sorts of nice utilities for dealing with text.

One thing I haven't seen that would be nice is to take emacs, add the
functionality of ditroff, and use a bit-mapped display that shows a
full page of text, just as it will come out of the laser printer.
Does something like this exist?

As far as character sets go, it would seem that 16 bits (65536
possible characters) should be more than enough.  About 9000
for Chinese, and 7000 for Japanese, plus all the European
languages, some math and other symbols, and there should be
room left over for some simple graphics characters.  In fact,
15 bits should be enough, leaving one bit for parity or flagging.

Snoopy
tektronix!tekecs!doghouse.TEK!snoopy

peter@graffiti.UUCP (Peter da Silva) (10/19/85)

> > Perhaps keyboard designers should all be required to learn touch
> > typing and then should be required to spend many hours typing on a
> > prototype of their creations before being allowed to select a final
> > design.
> >
> > Charlie Sorsby
> 
> 
> Hear, hear! Those who design *anything* should be required to use it! This...

...isn't always possible.

While I'm often amazed at the junk that other programmers produce & call
finished products, I'm embarrased at some of the stuff that I've been forced
to release before it's really ready & debugged: I always try to spend at least
a couple of days *after* I'm satisfied just looking for the limits. All too
often I've been pulled from a project before I've had a chance to do this.

Keyboards, now. Let's not start that up again. I'm pretty sure everyone's
already been through the "why didn't IBM put a Selectric-style keyboard
on the IBM-PC, and why is everyone following that schlock design?" debate
too many times...

henry@utzoo.UUCP (Henry Spencer) (10/20/85)

> As far as character sets go, it would seem that 16 bits (65536
> possible characters) should be more than enough.  About 9000
> for Chinese, and 7000 for Japanese, plus all the European
> languages, some math and other symbols, and there should be
> room left over for some simple graphics characters...

The trouble with this (and the other similar proposals) is it asks the
Western world to pay a factor of 2 in storage overhead for the sake of
the Asian character sets.  This will never sell.  Most of the sites that
would be affected will never want to store *anything* written in Japanese
or Chinese.  Why should they pay double the storage price (and bandwidth
price) for the ability to do so?

The only reason that the new 8-bit ISO standard isn't going to cause
major disruption (except in a few sloppy Unix programs) is that the
8th bit is already there, and largely unused, in existing machines.

Solving the problems of the Asian languages is a laudable goal, but I
am not convinced that we know how to do it effectively.  The new ISO set
will be an important step towards solving the problems of the Western
languages, and this may be all we can realistically hope for in the short
term.  "The best is the enemy of the good."
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

seifert@hammer.UUCP (Snoopy) (10/23/85)

In article <6066@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
>> As far as character sets go, it would seem that 16 bits (65536
>> possible characters) should be more than enough.  About 9000
>> for Chinese, and 7000 for Japanese, plus all the European
>> languages, some math and other symbols, and there should be
>> room left over for some simple graphics characters...
>
>The trouble with this (and the other similar proposals) is it asks the
>Western world to pay a factor of 2 in storage overhead for the sake of
>the Asian character sets.  This will never sell.  Most of the sites that
>would be affected will never want to store *anything* written in Japanese
>or Chinese.  Why should they pay double the storage price (and bandwidth
>price) for the ability to do so?

I don't like having to use up more bits per character either, but I
can't see any way around it.  Non-fixed length characters would be
a real mess, as someone pointed out.  (Previous to his/her article
I had been thinking that this would be a great idea.)  For some
applications, data compaction could be done.

>The only reason that the new 8-bit ISO standard isn't going to cause
>major disruption (except in a few sloppy Unix programs) is that the
>8th bit is already there, and largely unused, in existing machines.

Better than nothing, but it doesn't solve the whole problem.  How
many times do you want to change the standard?  I'd like to see it
changed once, correctly, and be done with it.

Problem is, that changing the standard character set is going to be
a really major change.  In addition to changing software, there's
all those terminals that need to change.

Whatever we do, there is going to be a LONG period of time when
we have to deal with *both* standards.  Which will likely be a mess.
We have to introduce the new without blowing away the old.

Maybe the old standard won't go away at all.  We seem to survive
with both ascii and ebsidic(sp?).  With Beta and VHS, with metric
and SAE hardware, etc. etc.

Also note that memory/disk costs are dropping.  Sixteen bit chars
are not as outragious sounding as they were a few years ago.
(I know, I know, they'll never be as cheap as we'd like.)

We definitely need to make an improvement.  What we have now
is not good enough. A change, any change, is going to be
painful.  We have a chance to do it right, Let's go for it!

--------------------------
Regarding this newsgroup, appariently the new policy is that
any group gets killed unless created with the permission of
the US net-lords.  It appears that Europe isn't allowed to
create groups based on the consensus of a conference.  -sigh-

Whoever's counting, add one vote for net.international, or
net.unix.international, or whatever.  No, *don't* create
a Europe-only group, that's taking a step backwards.
--------------------------
Snoopy, waiting for the day I'm forced to buy a Chinese-English
	dictionary to read my e-mail.
(ihnp4 | decvax | allegra | ???) !tektronix!tekecs!doghouse.TEK!snoopy

jr@inset.UUCP (Jim R Oldroyd) (10/23/85)

In article <960@erix.UUCP> mike@erix.UUCP (Mike Williams) writes:
>It all depend on for what we want to use UNIX. I use UNIX as a collection of
>programming tools. However if I wanted to use UNIX for word processing, I
>might buy a good word processing package.

This is EXACTLY the point I was making.  Mike goes on to say that
it is up to each piece of software to interprate the internal codes
in whatever way it suits best.

But what happens if one wishes to use the same input files for
different applications?

-- 
++ Jim R Oldroyd
++ jr@inset.UUCP
++ ..!mcvax!ukc!inset!jr

dave@enmasse.UUCP (Dave Brownell) (10/25/85)

In article <311@graffiti.UUCP> peter@graffiti.UUCP (Peter da Silva) writes:
>> ... Of course I could build these on top of UNIX, but
>> why bother when I can buy other systems like SmallTalk with all these things
>> anyway?
>
> Because you can't get SmallTalk (if you ever considered UNIX to be a resource
> hog, have a look at SmallTalk some time), and the Mac user interface is
> running on a horrid CP/M-like operating system.

You CAN TOO get smalltalk on a Mac !!!  I've seen it and it actually
looks good (thanks Mark!) on a 1 Mb Mac with a hard disk.

Though you're right about SmallTalk needing memory -- I wouldn't want to
run it on a 512K Mac, or without a hard disk.  But it IS available, and
from Apple at that.  (See discussions on INFO-MAC, and the next SMUG meeting.)
-- 
David Brownell
EnMasse Computer Corp
...!{harvard,talcott,genrad}!enmasse!dave

gnu@l5.uucp (John Gilmore) (10/28/85)

In article <6066@utzoo.UUCP>, henry@utzoo.UUCP (Henry Spencer) writes:
> > As far as character sets go, it would seem that 16 bits (65536
> > possible characters) should be more than enough...
> 
> The trouble with this (and the other similar proposals) is it asks the
> Western world to pay a factor of 2 in storage overhead for the sake of
> the Asian character sets.

I think the proposals are that a coding scheme for text be defined which
allows 16-bit characters to be escape-coded into an 8-bit text stream.
The arguments mostly center on what kind of coding scheme would fit both
the needs of few-16-bit-char folks and few-8-bit-char folks without wasting
too much storage for either.

Internally to an international program, characters would be 16 bits,
but stdio routines (printw, fprintw, sscanw, etc) would encode to a
bytestream on the way in and out.  ("w" for "world" or "wide").

(Hmm, the non-Unix-opsys people have been looking for a way to tell when
we Unixoids are reading or writing a text file versus a binary file...now
that we propose encoding our own text files, they will have the clue.)

kimcm@diku.UUCP (Kim Christian Madsen) (10/28/85)

In article <1581@hammer.UUCP> tekecs!doghouse.TEK!snoopy writes:
>Problem is, that changing the standard character set is going to be
>a really major change.  In addition to changing software, there's
>all those terminals that need to change.

Can you imagine a keyboard with 65535 different characters available,
*WOUW* (-; 

Well, I work with keyboard layouts with app. 200 different visible 
characters available, by pressing ctrl, alt and certain dead keys
(like accent keys) to obtain the many characters within a sensible
keyboardsize. The major obstacle to bypass is not the terminals (you
can just build the character proms big enough!) but the keyboards.
If you have all the fancy characters at hand, you bet some will want
to use them. Having each key on the keyboard to represent more than
4 different characters is too frustrating (I have enough trouble
finding the correct character with only *FOUR* different characters
for each key!!!) And I would certainly not be too satisfied with a
keyboard which fills all of my desk

Yes, there certainly is a need for a International Standard of
lettering, I would like to be able to use the correct way of adressing
a person in another country with another alphabeth, whether its
japanese, french, danish or whatever. But I see no easy way of doing
this (if the human interface is going to be friendly).

Maybe we shall have to wait for the computer which understands human
speech, and then translates the spoken word into the proper characters!

However some advance is still possible, if we restrict the characters
to those build upon the LATIN characters (ABCDE...etc) we can do it
with easy to remember keys, like Olivetti has done with their M24,
where you can hit the (dead) key ' and then an e and get an e with an
accent aigu. This can be done with all the accents and the like 
( ' ` ^ " ~ u v o ) and thereby increase the number of characters a
great deal! We might be able to create a full European characterset
including the characters used in Eastern Europe.

However to use LATIN, japanese, chinese, arabic, hebraian and other
characters types simultaneously on the same keyboard isn't going to
work well.

						Kim Chr. Madsen
						kimcm@diku.uucp

robert@erix.UUCP (Robert Virding) (10/30/85)

In article <224@l5.uucp> gnu@l5.uucp (John Gilmore) writes:
>In article <6066@utzoo.UUCP>, henry@utzoo.UUCP (Henry Spencer) writes:
>> > As far as character sets go, it would seem that 16 bits (65536
>> > possible characters) should be more than enough...
>> 
>> The trouble with this (and the other similar proposals) is it asks the
>> Western world to pay a factor of 2 in storage overhead for the sake of
>> the Asian character sets.

The way I see is that *everyone* will have to pay a price to allow
"foreign" character sets.  And anyway there are more Asians than
people with european origins so it seems only just. :-)

>I think the proposals are that a coding scheme for text be defined which
>allows 16-bit characters to be escape-coded into an 8-bit text stream.
>The arguments mostly center on what kind of coding scheme would fit both
>the needs of few-16-bit-char folks and few-8-bit-char folks without wasting
>too much storage for either.

Wow, this sounds like trying to convert ITS-Emacs' 9-bit ascii into
7-bit sequences, but 7 bits worse.  Talk about breaking existing
programs.  Ans who is to say that the *english* alphabet should be in
the 8-bit set?

>Internally to an international program, characters would be 16 bits,
>but stdio routines (printw, fprintw, sscanw, etc) would encode to a
>bytestream on the way in and out.  ("w" for "world" or "wide").

Does this mean there will be two basically different types of programs
that handle text?  And will these two worlds be able to communicate
with each other through text files?  This sounds a little like "we
have to accept that the rest of the world may like to use their own
language, but <strong expletive> if we english speakers are going to
have to change anything their sakes".

			Robert Virding  @ L M Ericsson, Stockholm
			UUCP: {decvax,philabs,seismo}!mcvax!enea!erix!robert

peter@graffiti.UUCP (Peter da Silva) (10/31/85)

> However to use LATIN, japanese, chinese, arabic, hebraian and other
> characters types simultaneously on the same keyboard isn't going to
> work well.

What you need for the multiple-languages problem is a dynamic keyboard. For
instance, you could use an LCD touch-screen for the keyboard & display the
currently-selected character set... this would also solve the problem of
switching between Qwerty, Dvorak, and some of the more exotic layouts that
have been proposed.

And for the problem of the large number of glyphs in oriental character sets,
I believe there are already systems that address this problem. They provide
a katakana keyboard, and after entering a word a selection of kanji characters
is displayed for the operator to select the correct one from. (I hope I have
the correct terms here).

Combining the two ideas, you could have the *keyboard* itself change to one
containing the possible kanji for a given word after entering each world.

I know there are UNIX sites in Japan. Are there any on this net?
-- 
Name: Peter da Silva
Graphic: `-_-'
UUCP: ...!shell!{graffiti,baylor}!peter
IAEF: ...!kitty!baylor!peter

seifert@hammer.UUCP (Snoopy) (11/03/85)

In article <18@diku.UUCP> kimcm@diku.UUCP (Kim Christian Madsen) writes:

> Can you imagine a keyboard with 65535 different characters available,
> *WOUW* (-; 

I'd rather not, actually.  One possibility is that the terminal can
*display* any character, but the keyboard remains a reasonable size.
There could be an optional second keyboard with additional characters.

> Maybe we shall have to wait for the computer which understands human
> speech, and then translates the spoken word into the proper characters!

This is bound to be unsuitable for many environments.  Would be nice to
have around when it *was* useable, though!  Might be especially handy
for portable computers, where the keyboard is already the limiting
factor for compactness.

> However to use LATIN, japanese, chinese, arabic, hebraian and other
> characters types simultaneously on the same keyboard isn't going to
> work well.

I'd like to see a 'standard western keyboard' that had all the
characters for English, German, French, Swedish, etc., plus
Greek for math/engr and for APL, a few of the common math symbols,
and of course copyright and trademark symbols for all the net lawyers.
:-)  A few simple graphics characters would be nice.  This is
possible on a keyboard of reasonable size.  Soft keys would allow
easy access to a 'few' other characters.  (The keyboard I'm using now
has 24 (!) extra soft keys.  Wish I could load them up with alphas and
umlauts and such.) These can be downloaded from the host easily enough.
If you need another language like Hebrew or Chinese, plug in a second
keyboard.  (Presumably in, say China, the Chinese keyboard would be the
primary, and the western keyboard the optional one.)

Is that better?

Auf Wiedersehen,
Snoopy (ECS RONIN #901)
tektronix!tekecs!doghouse.TEK!snoopy

donn@hpfcla.UUCP (11/04/85)

I'm not sure about ANSI, but both ISO and JIS have standards for character
font selection.  (So does GOST (that's Russia, for those who care), I 
believe.)  Before carrying any discussion further on the issue of character
sets, it's probably a good idea to do it in the context of existing standards.

These standards do NOT solve all (or anything like all) of the problems,
but any proposal inconsistent with them is doomed to fail due to government
standards (usually NOT in the US) endorsing the above standards.

In particular the ESC character is used in conjunction with SI and SO for
a lot of mixed font data.

I don't have copies of the relevent standards handy, and I'm not enough of
an expert to talk sensibly about the technical issues, but pragmatic 
reality says that these standards have to be considered.

Donn Terry
HP Ft. Collins
ihnp4!hpfcla!donn    (303)226-3800 x2367

P.S.  Honeywell (Arizona??) used to print a multi-colored chart of all the
character set standards current at the time.  It included ASCII/ISO, JIS,
GOST, and (gasp) EBCDIC (at least).  It summarized all the exceptions, national 
conventions, and had citations to the relevent standards.  Does anyone know if
they've kept it up, or if there is an equivalent I could get?

franka@mmintl.UUCP (Frank Adams) (11/04/85)

In article <988@erix.UUCP> robert@erix.UUCP (Robert Virding) writes:
>In article <224@l5.uucp> gnu@l5.uucp (John Gilmore) writes:
>>I think the proposals are that a coding scheme for text be defined which
>>allows 16-bit characters to be escape-coded into an 8-bit text stream.
>>The arguments mostly center on what kind of coding scheme would fit both
>>the needs of few-16-bit-char folks and few-8-bit-char folks without wasting
>>too much storage for either.
>
>Wow, this sounds like trying to convert ITS-Emacs' 9-bit ascii into
>7-bit sequences, but 7 bits worse.  Talk about breaking existing
>programs.  Ans who is to say that the *english* alphabet should be in
>the 8-bit set?

I think you miss the point here.  Certainly the 8-bit code should support
the basic Roman alphabet and reasonable extensions to it.  This will cover
all the European languages except Greek and those using the Cyrillic
alphabet.  (What to do about those, as well as Arabic, is not obvious.)
What is not included is the Japanese and Chinese ideographs, which do not
fit in an 8 bit code just by themselves.  Doubling the size of all text files
is just not a viable option.

Let me make a more concrete proposal for a standard (although still pretty
vague).  One needs an escape character from an 8-bit Acsii code.  The obvious
choice for this is decimal 255 (hex FF).  Following the escape byte would
be a byte identifying the function.  Functions include:

* The following two bytes are a 16-bit character.

* Change into 16-bit mode.

* Specify the alphabet to be used for subsequent characters (e.g., Greek,
Cyrillic, Arabic, etc.)

The same two byte sequences can be used as escapes from the 16 bit mode.
Thus, if 01 is the function code for the Roman alphabet, the 16 bit
"character" FF01 would mean "drop into 8 bit mode, using the Roman alphabet".

This would mean two bytes of overhead per file for documents using a
different alphabet.  I do not think this is an unacceptable overhead.

Now, this would leave the default to be the Roman alphabet.  This is de facto
discriminatory, but the reasons for it are not.  The cost of converting to
a non-upward compatible format are large.  (The cost of converting to an
upward compatible format are large enough that it will be a problem.)

>This sounds a little like "we
>have to accept that the rest of the world may like to use their own
>language, but <strong expletive> if we english speakers are going to
>have to change anything their sakes".

Yeah, it does sound a bit like that.  And there are people who feel that way.
But there are also good economic reasons for finding an upward compatible
solution.  And regardless of the reasons, if you don't make it easy for the
English speakers to adopt the standard, they won't, and the effort will fail,
or at best be much less successful than it could have been for many years.
I think success in this endeavor is much more important than keeping to any
absolute standards of fairness.  (Absolute is a key word in that sentence.
Some minimum of fairness is what this is all about.)

Frank Adams                           ihpn4!philabs!pwa-b!mmintl!franka
Multimate International    52 Oakland Ave North    E. Hartford, CT 06108

roy@phri.UUCP (Roy Smith) (11/06/85)

> One needs an escape character from an 8-bit Acsii code. [...] Following
> the escape byte would be a byte identifying the function.  Functions
> include: [...] Specify the alphabet to be used for subsequent characters
> (e.g., Greek, Cyrillic, Arabic, etc.)

	Just to play devil's advocate for a minute, let's say you have a
file in greek, with the first couple of bytes being the "locking shift to
greek" function.  Guess what breaks:

	Tail -- you can't get the last 10 lines of a file if you don't read
the whole file and track the shift commands.

	Grep -- you're looking for all lines containing pi-iota-gamma;
should grep track the shift commands and surround each output line by
"locking shift to greek" and "back to English"?  If you do it that way, and
run "grep ^ < greek1 > greek2", the greek[12] files will not cmp the same
because the second will have lots of extraneous shift commands.  Do you now
need a shift-optimizing filter to put files into canonical form?

	I'm sure there are more examples, but you get the idea.
-- 
Roy Smith <allegra!phri!roy>
System Administrator, Public Health Research Institute
455 First Avenue, New York, NY 10016

craig@dcl-cs.UUCP (Craig Wylie) (11/06/85)

In article <18@diku.UUCP> kimcm@diku.UUCP (Kim Christian Madsen) writes:

> Can you imagine a keyboard with 65535 different characters available,
> *WOUW* (-; 

The main problem here is obviously one of size (Awards for statement
of the bleeding obvious need not be presented). People seem to want a keyboard
that will not only display the required characters on the tops of the keys
but they also want it general enough to handle n different character sets. 
Note that as well as additional characters on a French keyboard
compared to a British one, even some of the standard characters are in 
different places.

What we need is a re-configurable keyboard. Of usable size but with all 
needed charaters displayed, as they appear on the screen, on the tops
of the keys. OK. Implant a matrix of LCD on top of each key and display the
character that it represents. If a large enough matrix was available then
surely most charaters could be displayed. If a set contains more characters
than available keys then one key changes between 'pages' of characters. This
needs more thought as a language with 5000 charactes would have far too many
'pages' to be useful, but heuristics could probably be devised to help.

The internal representation  of character sets is a problem to be resolved at
another time (stay tuned).



			Craig.

mikeb@inset.UUCP (Mike Banahan) (11/07/85)

It is worth noting that to provide support for languages with a very large
repertoire (``characters'') such as Chinese, it is not common practice to
use a particularly large keyboard. The technique normally employed for
data entry in such languages is different.

Typically it is done by entering a phonetic equivalent of the word that is
wanted, using a small number of characters: a phonetic notation for Chinese,
using roman characters, is already well established. The terminal has enough
intelligence to search its dictionary of characters and to display several
alternatives in the large character set which correspond more or less closely
to the phonetic input. The user selects the one wanted and carries on.
This sounds slow, but as far as I remember it is recognised as being
one of the quickest ways of actually inputting ideograms.

Anyhow, the upshot is that you can input Chinese using standard keyboards.
The terminal display and intelligence has to be upgraded considerably, but
then that is pretty simple nowadays. The terminal I'm using now isn't that
much less intelligent than our Vax (and a lot less overloaded!). Forget all
those pictures in the silly papers of Chinese typewriters with a keyboard the
size of a table.

-- 
Mike Banahan, Technical Director, The Instruction Set Ltd.
mcvax!ukc!inset!mikeb

greg@hwcs.UUCP (Greg Michaelson) (11/07/85)

> In article <18@diku.UUCP> kimcm@diku.UUCP (Kim Christian Madsen) writes:
> 
> > Can you imagine a keyboard with 65535 different characters available,
> > *WOUW* (-; 
> 
> I'd rather not, actually.  One possibility is that the terminal can
> *display* any character, but the keyboard remains a reasonable size.
> There could be an optional second keyboard with additional characters.

There's a research project ( Southampton Uni ???) which is putting
light matrix displays into keys to show the characters the keys currently
activate. When a new character set is programmed the key displays get updated.
With a character set menu it might be possible to pull
down huge character sets in manageable chunks. Maybe an intelligent system
could learn/predict when different characters are used in different contexts?

mats@fortune.UUCP (Mats Wichmann) (11/08/85)

Okay, I don't know if anyone has posted this, we seem to be getting
things very sporadically here so I may have missed it. However. 
There is an ISO standard for "code extension techniques" (ISO 2022) 
which is supposed to address these wonderful issues.  It starts 
from 7-bit ASCII (very important, because they use the 8th bit...). 
There are two ways to shift character sets: "Single-shift" and 
"Locking-Shift". Single shift is like you pressing the SHIFT or 
CONTROL key on your terminal - it has to be done for each character.
Locking Shift puts you into a different mode until an unlock sequence 
comes along.

The AT&T internationalization proposal is based on this idea,
but uses only single-shift, and basically follows these two rules:

1. If the high-order bit of an 8-bit byte is turned off, the 8-bit
   sequence comes from an ASCII character set.

2. If the high-order bit is turned on, the 8-bit sequence is non-ASCII
   and should be interpreted as belonging to one of the three local
   character sets. The exact character set it belongs to depends
   on the internal coding method and whether it was preceded by
   a single-shift character.

There will be special "single-shift characters" which signify
one or two byte following sequences (the two magic cookies
which select this would be "SS2" = 0x8e and "SS3" = 0x8f).
The above is a major condensation, and only represents the
proposal as I understand it.

The reference document is: "Information Processing - ISO 7-bit
and 8-bit Coded Character Sets - Code Extension Techniques",
ISO 2022-1982(E).

I am relatively new to this game, so if anyone has sensible
objections to this scheme, I would love to be educated.

This sort of suggestion does of course not tackle issues
like sorting at all; it merely suggests how to represent
the data, not what you can do with it.

Mats Wichmann
Fortune Systems
{ihnp4,hplabs,dual}!fortune!mats

peter@graffiti.UUCP (Peter da Silva) (11/10/85)

> > One needs an escape character from an 8-bit Acsii code. [...] Following
> > the escape byte would be a byte identifying the function.  Functions
> > include: [...] Specify the alphabet to be used for subsequent characters
> > (e.g., Greek, Cyrillic, Arabic, etc.)

Ascii is really a 7-bit code. Thus if the 8th bit is set then this byte and
perhaps the next can be considered escaped info. I don't believe that locking
shifts are a good idea, though, since it makes it hard to take an arbitrary
lump of text and tell what it means. Since it's been established that there is
no way of implementing a general foreign-language sort without table look-up
and perhaps more involved heuristics (to handle dutch "ij", for example)
anyway, why not do something like this...

	0xxxxxxx	  Normal ASCII
	10xxxxxx	  Foreign ROMAN characters
	11xxxxxx xxxxxxxx Kanji or other extended character

...and just stuff all the foreign variants into the 64 extra characters this
makes available for the purpose.

I know I said something like this before, but nobody seems to have noticed and
I am sufficiently egocentric to believe that there is something to it...
-- 
Name: Peter da Silva
Graphic: `-_-'
UUCP: ...!shell!{graffiti,baylor}!peter
IAEF: ...!kitty!baylor!peter

mark@cbosgd.UUCP (Mark Horton) (11/11/85)

The Japanese Kanji character set can be input in the same phonetic
way as was described for Chinese.  (You type in 2 or 3 Roman letters
which phonetically sound like the syllable you want, and it turns into
the (unique) Katakana glyph for the syllable you want.  You do this
for every syllable in the word and then press a special key, and
something consults a (big) table and finds all the glyphs that sound
like that.  It puts up a menu, which often has 2-6 choices, on an extra
line at the bottom of the terminal.  You pick one and it goes up on
the screen.

I'm told there are about 60000 Kanji characters, and a few tens of
thousands more Chinese characters (I can't remember the exact numbers.)
However, a subset that fits in 14 bits is in common use, and they are
willing to restrict theirselves to this subset.

There are apparently already official standards for encoding Kanji
in 16 bits, intermixed with ASCII.  It seems that you take the 14
bits and put them in two bytes, each byte with the 8th bit on.
Having two consecutive bytes with the parity bit on means it's a
Kanji character.  A single parity character might have a different
international meaning.

This doesn't break tail or grep.  I don't know what they do if there
are two European characters in a row, but I gather there is some
standard way of dealing with this.  The only mode needed is attached
to the keyboard, so it can tell if you're typing in Roman or Katakana.

By the way, I've seen several references to a function "printw" with
an assumption that this would be a 16 bit printf.  I'd like to point
out that the name "printw" has already been taken by curses, which is
present in both 4BSD and System V.  (printw means "print window.")
I'm not even convinced that such a function is needed, since the
existing standards seem oriented toward streams of 8 bit bytes.
I don't think stdio cares whether a character is Kanji or Roman,
that's between the application and the terminal.  Regular old printf
works fine.

	Mark Horton

P.S. While everybody agrees that this group should exist and should be
distributed worldwide, but the name "net.internat" is terrible.  Let's
settle the issue of whether it's to be moderated (I understand we have
a volunteer to be the moderator) and then call it either net.international
or mod.international.

req@warwick.UUCP (Russell Quin) (11/11/85)

[...]
>What we need is a re-configurable keyboard.  Of usable size but with all 
>needed charaters displayed, as they appear on the screen, on the tops
>of the keys.  [ suggests using LCDs ] [...]
>If a set contains more characters than available keys then one key changes
>between 'pages' of characters.

I have enough problems coping with modes in editors (at lot of software seems
to have at least two modes where keys typed are interpreted differently),
without having to worry about what mode the *keyboard* is in as well!
This sort of information must be duplicated on the screen if it is to be
useful at all.  In any event, I don't look at both screen and keyboard when
typing.  Usually just the screen, in fact, unless the terminal is unfamiliar to
me (like this one).  Another problem -- look at the buttons on your keyboard.
Are they clean?  Not only do fingers conceal the keytops, but dirt wouldn't
help either, as well as the difficulty of getting an adequate connection to the
tiny display as it moves up & down.

There seem to be several other issues involved.
1	people using differnt alphabets need different sets of characters
	available.  A French keyboard without a cedilla is as useful as a
	Finnish one with a cedilla but no umlaut.

2	portability -- it isn't helpful if a program uses the grave accent
	(eg. Bourne Shell) and this happens to print as a Pound Sterling symbol
	on your device.  So it would be good if the same characters always
	printed in the same way.

3	Big alphabets -- there are already too many charcters to fit onto a
	sane keyboard, but a big problem comes when there are *many*
	characters.  One possible solution here that has already been mentioned
	involves using multiple-character names for symbols and having a
	routine to turn this into/out of an internal representation.  The
	characters would be stored in a homogenous way, so grep-like tools
	would work.  This would help for maths symbols, too.
	Which leads up to

4	Mixed alphabets: what does
	grep '[a-deltaC-OMEGA]' file
	mean?  What about
	grep '[alpha-epsilon ALPHA-EPSILON aleph yod Man-In-House-With-Dog]'?
	It seems sensible not to define the meaning of ranges of
	mixed alphabets (eg. [aleph-delta]), so a character's alphabet would
	have to be obvious from the internal representation.

By the time we get this far, we seem to be moving away from a good-old-ASCII-
computer-system and towards a cross between a graphics machine and a typesetter!
Since presumably not all machines would ever have access to all alphabets,
there are huge portability problems.
Has anyone built a machine that goes even partway towards addresing these
areas?  TeX or Troff in the tty driver... [0.5 :-)]
Perhaps we would do better to try not to address the huge oriental alphabets in
this way at all -- the benefits don't seem wothwhile.

>The internal representation  of character sets is a problem to be resolved at
>another time (stay tuned).
>			Craig.

I feel that the representation is important.  A standard will not be useful if
it can't be implemented.

		- Russell
-- 
		... mcvax!ukc!warwick!req  (req@warwick.UUCP)
		... mcvax!ukc!warwick!frplist (frplist@warwick.UUCP)
friend: someone one seems to be able to tolerate at the moment

spw2562@ritcv.UUCP (11/13/85)

In article <422@graffiti.UUCP> peter@graffiti.UUCP (Peter da Silva) writes:
>	0xxxxxxx	  Normal ASCII
>	10xxxxxx	  Foreign ROMAN characters
>	11xxxxxx xxxxxxxx Kanji or other extended character
>...and just stuff all the foreign variants into the 64 extra characters this
>makes available for the purpose.
>I am sufficiently egocentric to believe that there is something to it...
>Name: Peter da Silva
>UUCP: ...!shell!{graffiti,baylor}!peter
>IAEF: ...!kitty!baylor!peter

This is a good idea - even if you are egocentric 8-).
Alternately, use a 16 bit code, as has been mentioned, keeping the lower byte
the same as the current ascii standard, but setting bits in the upper byte
to indicate using different varients of the base character.  This would
allow stripping the upper byte of to leave only 8 bits without changing which
basic character it is. As for totally unique characters, they could be
arbitrarily assigned.  If this has been suggested before, my apologies for
mentioning it again.  I just now started reading this newsgroup.

==============================================================================
	Steve Wall @ Rochester Institute of Technology
	USnail: 6675 Crosby Rd, Lockport, NY 14094
	Usenet:	...!ritcv!spw2562			Unix 4.2 BSD
	BITNET:	SPW2562@RITVAXC				VAX/VMS 4.2
	Voice:  Yell "Hey Steve!"

franka@mmintl.UUCP (Frank Adams) (11/15/85)

In article <2004@phri.UUCP> roy@phri.UUCP (Roy Smith) writes:
>	Just to play devil's advocate for a minute, let's say you have a
>file in greek, with the first couple of bytes being the "locking shift to
>greek" function.  Guess what breaks:
>
>	Tail -- you can't get the last 10 lines of a file if you don't read
>the whole file and track the shift commands.

Yeah, you can't.  You can read from the end back to the last (permanent) shift
command; anything preceding it is OK.  Of course, this frequently means
reading the whole file backwards.

>	Grep -- you're looking for all lines containing pi-iota-gamma;
>should grep track the shift commands and surround each output line by
>"locking shift to greek" and "back to English"?  If you do it that way, and
>run "grep ^ < greek1 > greek2", the greek[12] files will not cmp the same
>because the second will have lots of extraneous shift commands.  Do you now
>need a shift-optimizing filter to put files into canonical form?

Actually, I would use the sixteen bit format internally.  You use a standard
routine to read the file and convert to sixteen bit form, and another
standard routine to write the file, optimizing the shifts.  This takes
care of this whole class of problems.

>	I'm sure there are more examples, but you get the idea.

I never said it would be easy.  Just easier and more practical than throwing
away ASCII entirely, or having each non-ASCII character preceded by an
escape sequence.

While critical comments such as this are welcome, alternative suggestions
would be more so.

Frank Adams                           ihpn4!philabs!pwa-b!mmintl!franka
Multimate International    52 Oakland Ave North    E. Hartford, CT 06108

peter@graffiti.UUCP (Peter da Silva) (11/18/85)

> In article <422@graffiti.UUCP> peter@graffiti.UUCP (Peter da Silva) writes:
> >	0xxxxxxx	  Normal ASCII
> >	10xxxxxx	  Foreign ROMAN characters
> >	11xxxxxx xxxxxxxx Kanji or other extended character
> 
> Alternately, use a 16 bit code, as has been mentioned, keeping the lower byte
> the same as the current ascii standard, but setting bits in the upper byte

That would also work, but wouldn't address the problem of storage, which is
what I was attempting to address. It's already been mentioned that most people
aren't willing to put up with a factor of two increase in the size of their
text files just to satisfy the Japanese. The reason for the ethnocentric
use of ASCII as the base character set is because most of the worlds computers
are in the US...
-- 
Name: Peter da Silva
Graphic: `-_-'
UUCP: ...!shell!{graffiti,baylor}!peter
IAEF: ...!kitty!baylor!peter

henry@utzoo.UUCP (Henry Spencer) (11/19/85)

> ...I have enough problems coping with modes in editors ...
> without having to worry about what mode the *keyboard* is in as well!
> This sort of information must be duplicated on the screen if it is to be
> useful at all... Another problem -- look at the buttons on your keyboard.
> Are they clean?  Not only do fingers conceal the keytops, but dirt wouldn't
> help either, as well as the difficulty of getting an adequate connection to the
> tiny display as it moves up & down.

Actually, these problems can be solved by a sneaky trick.  You put an angled
glass plate over the keyboard, in your line of sight to it but high enough
that it does not obstruct hand access.  Then you put a monitor in the right
place so that the image of the monitor face seen in the glass is superimposed
on the keyboard.  Presto:  keytop displays that are dirt-proof and can be
seen *through* your fingers.  No tricky connection problems either.  It's
been tried, and it works pretty well.

It doesn't solve the problem of wanting to touch-type, though.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry