[comp.unix.wizards] byte != 8 bits

daveb@geac.UUCP (Dave Brown) (07/23/87)

In article <463@unisoft.UUCP> greywolf@unisoft.UUCP (The Grey Wolf @ ext 165) writes:
>	What is the problem here?  I see nothing wrong eight bits for a
>character.  Can you come up with anything better?  What's the matter?
>Are escape sequences for special characters too much for you to handle?
>Gimme a break.
>
>			Disgusted that this discussion is even *happening*,

Point 1: languages other than english have larger character sets than
	english.  Unless you wish to read transliterated text, the
	characters set used should contain *both* the english and
	non-english glyphs.  Escape sequencing occupies much more
	space than a large character set, and is understandably unpopular
	in (for example) Japan.


Point 2: FLAME ON
	Your last two sentences are an insult to non-english-speakers.
	FLAME OFF

  --David (Je ne parle pas Fran{\c}ais, je suis un rosbif) Collier-Brown
-- 
 David (Collier-) Brown.              |  Computer Science
 Geac Computers International Inc.,   |  loses its memory
 350 Steelcase Road,Markham, Ontario, |  (if not its mind)
 CANADA, L3R 1B3 (416) 475-0525 x3279 |  every 6 months.

dhesi@bsu-cs.UUCP (Rahul Dhesi) (08/01/87)

In article <857@bsu-cs.UUCP> I wrote:

>A byte is therefore exactly 8 bits.  No more and no less.

Amidst all the name-calling that followed, the following objection to
my statement was faintly discernible:

     Not all character sets will fit in 8 bits.

This is true, but it does not affect my claim.  A byte *is* exactly
8 bits.

First, 8 bits suffices for *most* of the world's languages.

Second, even if 8 bits is insufficient to hold a given character set
(and this is true for only a few languages), this simply means that
tradition must give way, and "character" and "byte" will not be
synonymous.  (If ANSI is not prepared for this, it's in for a rude
shock, in my opinion.)

Consider computer communications.  The world's networks deal in 8-bit
units.  Political reality being what it is, it was considered unwise to
call these bytes.  They are called octets.  What does one do with a
machine/character set with 9-bit bytes?  Map them to 8-bit bytes and
lose some information, or split them with shifting/masking and transmit
them as 8-bit units anyway.  One then finds things rather awkward.  One
embraces the 8-bit byte as soon as possible.

Consider the cost-benefit analysis manufacturers must do.  Those that
want bytes to be other than 8 bits must give up the convenience of
using a lot of off-the-shelf parts.  Custom hardware is expensive.

Consider simple elegance.  With a 9-bit byte, one is either stuck with
wasted bits in a 32-bit machine word, or one must use a 36-bit word and
end up with wasted bits within machine instructions and within data
structures and/or get a nonorthogonal machine architecture.  (Aside:
Why do we see useless machine instructions such as "jump never, label"
and "mov a,a"?  Because orthogonality simplifies machine design.)  The
same goes for any other byte size except 16 bits, in which case we
could just as well take a pair of 8-bit bytes and call them by a new
name.

Consider devices.  The 8-bit byte is a standard unit of information
transfer using tape drives.  And I have a hunch most disk drives/
controllers are designed with 512-bytes-per-sector formatting in mind,
which won't neatly fit with any arbitrary byte/word size.

Consider a lot of things, and the 8-bit byte stares you in the face.

And consider that in most cases, if 8 bits are not enough, neither are
9, or 10, or perhaps even 11.

How, then, does one deal with a character set that won't fit in 8 bits?

Predictions:

o    Such characters will, in the future, occupy two bytes.
o    There will be an increasing trend towards using transliterations
     that will allow unusual character sets to be represented using
     the Roman alphabet
o    Increasingly, computations will be done using English, even in
     countries where English is not a major language
o    Special-purpose machines using esoteric sizes of data units will
     continue to exist but will not replace general-purpose computers,
     which will continue to be based on the 8-bit byte.
-- 
Rahul Dhesi         UUCP:  {ihnp4,seismo}!{iuvax,pur-ee}!bsu-cs!dhesi

elman@sdamos.ling.ucsd.edu (Jeff Elman) (08/01/87)

One of the arguments people have advanced in favor of "bigger bytes"
is to accommodate a broader diversity of character sets. 8-bit byters 
have been accused of being lingua-centric (or worse) by assuming that 
8-bits suffice for all characters.  Kanji is usually mentioned as a
counter-example.

I'm a little confused about this argument.  While Kanji are often
called "characters", they're not characters in the sense most people
probably understand.  Kanji are ideograms, and  Kanji characters (or 
character pairs) correspond to what we think of as words.  Is the proposal 
thus that  bytes should be capable of transmitting entire words?  That 
hardly seems reasonable.  

Or have I missed something?

Jeff Elman
Linguistics/UCSD
elman@amos.ling.ucsd.edu

gwyn@brl-smoke.ARPA (Doug Gwyn ) (08/02/87)

In article <3566@sdcsvax.UCSD.EDU> elman@amos.ling.ucsd.edu (Jeff Elman) writes:
>I'm a little confused about this argument.  While Kanji are often
>called "characters", they're not characters in the sense most people
>probably understand.  Kanji are ideograms, and  Kanji characters (or 
>character pairs) correspond to what we think of as words.  Is the proposal 
>thus that  bytes should be capable of transmitting entire words?  That 
>hardly seems reasonable.  

The confusion is introduced by trying to take "character" and "word"
too literally.  What is necessary computationally is support for
handling individual basic textual units, whatever they might be.
In English, that includes letters of the alphabet in both upper-
and lower-case as well as digits and punctuation and separator
symbols.  One could include additional formatting controls as well,
and for some specialized disciplines such as mathematics a batch of
funny-looking squiggly things are also needed.

Thus, the desired "character set" contains whatever is necessary
so that a sequence of selections from the set can represent the
language.  In any case, the point was that a BYTE is NOT in general
large enough to encode all requisite basic textual units.

guy%gorodish@Sun.COM (Guy Harris) (08/02/87)

> This is true, but it does not affect my claim.  A byte *is* exactly
> 8 bits.
> ...9-bit bytes...
> ... 9-bit byte ...

Gee, if a byte *is* (emphasis yours) exactly 8 bits, why are you
talking about 9-bit bytes?

A byte is not exactly 8 bits.  Proof by counterexample: the DEC 10s
and 20s.  Or, for an even better counterexample, namely one that runs
UNIX: the Sperroughs Burrivac 1100 series.

If you meant "a byte *should be* exactly 8 bits wherever possible",
that's an opinion that can be meaningfully discussed.  However, you
said "a byte *is* exactly 8 bits", which is not an opinion, but an
antifact (i.e., it is stated as if it were a fact not subject to
debate, but it is false).
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com

mark@ems.MN.ORG (Mark H. Colburn) (08/03/87)

In article <911@bsu-cs.UUCP> dhesi@bsu-cs.UUCP (Rahul Dhesi) writes:
>In article <857@bsu-cs.UUCP> I wrote:
>This is true, but it does not affect my claim.  A byte *is* exactly
>8 bits.


	ARGH!!!!!!!

	NO!  An 'octet' is exactly eight bits, a byte is whatever size
	corresponds to the machine on which you are working.  Saying that
	a byte is 8 bits is like saying a word is *EXACTLY* 16 bits!

	Do any hear any BOOs out there from the people working on 680X0,
	or Crays, or Amdahls or ...????

	The ASCII character set is exactly 8 bits, true, but that does not
	correspond to byte size.  The standard transfer data size for most
	telecommunications protocols uses 8-bit characters, true, but that,
	again, does not correspond to anything.  And, again, *MOST* machines
	these days use 8-bit bytes, but, that does not mean that all bytes
	are 8 bits long.

	Drop it, your wrong!

-- 
Mark H. Colburn    DOMAIN: mark@ems.MN.ORG 
EMS/McGraw-Hill      UUCP: ihnp4!meccts!ems!mark      AT&T: (612) 829-8200

henry@utzoo.UUCP (Henry Spencer) (08/05/87)

> o    Increasingly, computations will be done using English, even in
>      countries where English is not a major language

The French and the Japanese, to name two, will dispute this.  And let us
not forget that "computations" nowadays are often done by people like
secretaries and accountants, who do not appreciate having to learn a foreign
language to do them.
-- 
Support sustained spaceflight: fight |  Henry Spencer @ U of Toronto Zoology
the soi-disant "Planetary Society"!  | {allegra,ihnp4,decvax,utai}!utzoo!henry