[net.unix] Does C depend on ASCII?

nishri@utcsstat.UUCP (Alex Nishri) (05/03/84)

Does anyone have any experience or comments about the dependability of
programs written in C on the ASCII character representation?  Could most
programs written in C be run on a different character representation
scheme?  What about the Unix system itself?

(For a completely different scheme consider EBCDIC.  The numerics collate
after the alphabetics.  So 'a' < '1' in EBCDIC.  Also EBCDIC has holes in
the alphabetic sequence.  Thus 'a' + 1 is equal to 'b', but 'i' +1 is not
equal to 'j'.  In fact 'i' + 8 equals 'j'.)

Alex Nishri
University of Toronto
 ... utcsstat!nishri

rpw3@fortune.UUCP (05/06/84)

#R:utcsstat:-187300:fortune:26900056:000:2297
fortune!rpw3    May  6 03:13:00 1984

[ After this, let's move this to "net.lang.c", shall we? ]

Many, many programs I have seen depend on certain characteristics of ASCII,
but I am sure it varies by program as to how much of the total sequence is
wired in. This has GOT to be a major factor in the cost of porting UNIX to
a non-ASCII machine. Most of what I have seen included the at least the
following hard dependencies:

1. The numbers are contiguous (no gaps).

	Kernighan & Ritchie [pp20-21]:
   	"This particular program relies heavily on the properties of the
	character representation of digits. For example, the test

		if (c >= '0' && c <= '9') ...

   	determines whether the character in "c" is a digit. If it is, the
	numeric value of that digit is

		c - '0'

   	This works only if '0', '1', etc., are positive and in increasing
	order, and if there is nothing but digits between '0' and '9'.
	Fortunately, this is true for all conventional character sets."

   Note particularly the word "all" in that last sentence. Again [page 39],
   in the sample "atoi(s)", the same assumption is made.

2. The lowercase letters (as a class) are contiguous, as are the uppers.
   Some programs know that 'A' + 040 == 'a', some don't. Some only depend
   on 'a' > 'A' (so that 'x' - 'X' is a positive number).

Interestingly, most of the programs I have seen DON'T assume any fixed
distance between '9' and 'A', but when converting hexadecimal input they
adjust for letters by subtracting 'A' - ('9' + 1) from the value of the letter.

3. The ASCII control characters exist, and have values of 'X' - 0100 for
   any control character <^X> (where 'X' is the upper-case letter of similar
   appearance). Is is known (for example) that newline == '\n' == 'J' - 0100,
   and that 'H' - 0100 is a backspace.

In sum, many programs assume ASCII, or at least, certain properties of the
collating sequence. The ones mentioned above are certainly not a complete
list of what you may find when trying to use another character set, but they
are a few "biggies". The use of "_ctype[]" can help, but many programs do not
use it with consistency.

Sorry 'bout that...

Rob Warnock

UUCP:	{ihnp4,ucbvax!amd70,hpda,harpo,sri-unix,allegra}!fortune!rpw3
DDD:	(415)595-8444
USPS:	Fortune Systems Corp, 101 Twin Dolphin Drive, Redwood City, CA 94065

barmar@mit-eddie.UUCP (Barry Margolin) (05/07/84)

Well, I don't know how dependent Unix really is on ASCII, but the folks
at Amdahl probably thought it was dependent enough that they use ASCII,
even though the underlying IBM-370 architecture is geared towards
EBCDIC.
-- 
			Barry Margolin
			ARPA: barmar@MIT-Multics
			UUCP: ..!genrad!mit-eddie!barmar

jlw@ariel.UUCP (05/08/84)

>Well, I don't know how dependent Unix really is on ASCII, but the folks
>at Amdahl probably thought it was dependent enough that they use ASCII,
>even though the underlying IBM-370 architecture is geared towards
>EBCDIC.
>-- 
>			Barry Margolin
>			ARPA: barmar@MIT-Multics
>			UUCP: ..!genrad!mit-eddie!barmar

This response should really be in net.arch, but I don't
agree that the IBM machine is really geared toward EBCDIC.
What is EBCDIC are the line printers, the card punches
and readers, and terminals, or, in other words, the
peripherals.  If you could get your C compiler typed in in
ASCII and compiled, then the IBM compare instruction
works just fine on ASCII.  Its all a big conspiracy
and a long chain of source text bootstrapping EBCDIC
along.



					Joseph L. Wood, III
					AT&T Information Systems
					Laboratories, Holmdel
					(201) 834-3759
					ariel!jlw

alanr@drutx.UUCP (RobertsonAL) (05/09/84)

My recollection of the only 370 EBCDIC dependencies in the HARDWARE
are some instructions for converting strings to packed decimal internal
representation.  Some floating point conversion algorithms may use
these instructions to good effect...
~a


	-- Alan Robertson
	   ihnp4!drutx!alanr
	   AT&T Information Systems
	   11900 N Pecos
	   Denver, Colorado 80020

rjh@ihuxj.UUCP (Randolph J. Herber) (05/09/84)

>>Well, I don't know how dependent Unix really is on ASCII, but the folks
>>at Amdahl probably thought it was dependent enough that they use ASCII,
>>even though the underlying IBM-370 architecture is geared towards
>>EBCDIC.
>>-- 
>>			Barry Margolin
>>			ARPA: barmar@MIT-Multics
>>			UUCP: ..!genrad!mit-eddie!barmar
>This response should really be in net.arch, but I don't
>agree that the IBM machine is really geared toward EBCDIC.
>What is EBCDIC are the line printers, the card punches
>and readers, and terminals, or, in other words, the
>peripherals.  If you could get your C compiler typed in in
>ASCII and compiled, then the IBM compare instruction
>works just fine on ASCII.  Its all a big conspiracy
>and a long chain of source text bootstrapping EBCDIC
>along.
>					Joseph L. Wood, III
>					AT&T Information Systems
>					Laboratories, Holmdel
>					(201) 834-3759
>					ariel!jlw
>My recollection of the only 370 EBCDIC dependencies in the HARDWARE
>are some instructions for converting strings to packed decimal internal
>representation.  Some floating point conversion algorithms may use
>these instructions to good effect...
>	-- Alan Robertson
>	   ihnp4!drutx!alanr
>	   AT&T Information Systems
>	   11900 N Pecos
>	   Denver, Colorado 80020

IAW GA22-6821 IBM System/360 Principles of Operation:
	"The sign and zone codes generated for all decimal arithmetic
	results differ for the extended binary-coded-decimal interchange
	code (EBCDIC) and the American Standard code for information
	(ASCII).  The choice between the two codes is determined by bit
	12 of the PSW.  When bit 12 is zero the preferred EBCDIC codes
	are generated, which are plus, 1100; minus, 1101; and zone 1111.
	When bit 12 is one, the preferred ASCII codes are generated
	which are plus 1010; minus 1011; and zone 0101."

IAW GA22-7000-8 IBM System/370 Principles of Operation:
	"COMPATIBILITY BETWEEN SYSTEM/360 AND SYSTEM/370"
	System/370 is forward-compatible from System/360. A program
	written for the System/360 will run on the System/370 provided
	that it:"
		"2. Does not use PSW bit 12 as an ASCII bit ...."

BTW, PSW bit 12 is used in IBM System/370 to control the PSW format.
When the bit is zero, the machine and PSW format is in Basic Control mode.
When the bit is one, the machine and PSW format is in Extended Control mode.

It is my understanding that the main reasons for using ASCII in Amdahl's UTS
are efficiency and program reliability. Many programs are coded assuming
the characteristics of the ASCII character set.  Further details on this
topic I do not have.

	Randolph J. Herber, Amdahl Senior Systems Engineer,
	..!ihnp4!ihuxj!rjh, (312) 979-6554, or Cornet Indian Hill x6554,
	Amdahl Corp., Suite 250, 6400 Shafer, Rosemont, IL 60018,
	(312) 692-7520

barmar@mit-eddie.UUCP (Barry Margolin) (05/10/84)

--------------------
I don't
agree that the IBM machine is really geared toward EBCDIC.
What is EBCDIC are the line printers, the card punches
and readers, and terminals, or, in other words, the
peripherals.
--------------------

OK, so it is not the machine, but VM, under which UTS runs, certainly
must depend on it.  So, any text passed to system calls has to be
translated to EBCDIC.
-- 
			Barry Margolin
			ARPA: barmar@MIT-Multics
			UUCP: ..!genrad!mit-eddie!barmar

rjh@ihuxj.UUCP (Randolph J. Herber) (05/10/84)

>From eli Wed May  9 22:26 EDT 1984 remote from houxq
>FROM: e.d.mantel
>TO: ihuxj!rjh
>DATE: 09 May 1984  22:26 EST
>SUBJECT: IBM 360 and ASCII mode
>
>I'm sure your addendum to the IBM == EBCDIC discussion was intended
>to be helpful, but I don't think you succeeded.
>
>I happen to know what an EC PSW is; most of the people reading the
>net have a chance at understanding that packed decimal is a data format,
>but are unlikely to know what EC PSWs have to do with ASCII support.
>
>The point of the prior articles in this discussion was the fact that the
>CPU contains very little character code dependency.  I presume the point
>of your article was:
>	Once upon a time, IBM had an ASCII mode of operation.  All
>	this did was allow packed decimal operations to be done in
>	ASCII (whatever it means to have ASCII packed decimal!).  A
>	while ago (when the 370 line was introduced), IBM decided that,
>	as far as they were concerned, ASCII was not needed, and dropped
>	support for it.
>
>As for the reasons for UTS being ASCII-based, this has many advantages.  It
>is stretching the definition of "reliability" when you really want to say
>things like portability and compatibility.  As for efficiency, it's such
>a minute consideration compared to the portability and compatibility issues
>that there's no need to mention it.
>
>						Eli Mantel

I thank Eli for his clarification; and, basically, I agree with what
he said.  I have his permission to quote this private communication.

	Randolph J. Herber, Amdahl Senior Systems Engineer,
	..!ihnp4!ihuxj!rjh,
	(work) (312) 979-6554 or AT&T Cornet 8-367-6554,
	(off.) Amdahl Corp., Suite 250, 6400 Shafer, Rosemont, IL 60018,
	       (312) 692-7520

koved@umcp-cs.UUCP (05/10/84)

There appear to be only 2 instructions which are EBCDIC dependent -
PACK and UNPACK.  These are used to convert EBCDIC digits to packed
form which is used for arithmetic, or can be further converted to binary.
Conversion from EBCDIC to ASCII and back is trivial...use a translate
table (256 bytes) for each direction, and a single instruction (TRanslate)
to convert up to 256 bytes at a time.

BTW, ASCII terminals are frequently attached to IBM Mainframes.
Two methods are possible.  The direct attach (or dial-up) is done,
and typically supports the IBM 3101 (an ASCII terminal) or other
types of TTY.  The other method is to attach to an external box
(typically the IBM Series/1) which does the conversion between
character sets.

I hope this clears up some of the confusion.

P.S.  The only devices which are character set dependent are the
display units (3270 type devices), and printers.  Some are EBCDIC,
and some are not.  Codes may be device dependent.
-- 
Spoken: Larry Koved
Arpa:   koved.umcp-cs@CSNet-relay
Uucp:...{allegra,seismo}!umcp-cs!koved

andrew@orca.UUCP (Andrew Klossner) (05/10/84)

	"I don't agree that the IBM machine is really geared toward
	EBCDIC."

A program running on an 370-architecture machine can convert a binary
number to an EBCDIC string in two instructions.  There is no
corresponding conversion to ASCII, though there was an "ASCII mode" bit
in the Program Status Word on the 360 line, dropped in the 370 line.

  -- Andrew Klossner   (decvax!tektronix!orca!andrew)      [UUCP]
                       (orca!andrew.tektronix@rand-relay)  [ARPA]

gwyn@brl-vgr.ARPA (Doug Gwyn ) (05/12/84)

Traditionally C has used the host computer "native" character set
(how can a convention be "native"? you ask; yet it really is).
However many programs written in C implicitly assume that the
character set is ASCII, although the language doesn't guarantee this.

I seem to recall that the C Language Standards Committee addressed
this question but I don't remember whether they decided that ASCII is
the "official" C character set.

For my own use in those few cases where the character codes are
important, I have the following lines in my standard header file:
/* integer (or character) arguments and value: */
/* THESE PARTICULAR DEFINITIONS ARE FOR ASCII HOSTS ONLY */
#define tohostc( c )	(c)		/* map ASCII to host char set */
#define tonumber( c )	((c) - '0')	/* convt digit char to number */
#define todigit( n )	((n) + '0')	/* convt digit number to char */

The idea is to use toascii() to map the native input characters to
internal ASCII form, although you then have to do the same to the
C character constants against which the mapped input characters are
to be compared (or else use numerical ASCII codes).  Then on output
one uses tohostc() to map the internal form back to native chars.
Obviously there is non-negligible run-time overhead if the host
character set is not ASCII but something stupid like EBCDIC, but I
am willing to live with this in order to not have to change my source
code when I port it to a non-ASCII machine (just the standard header
needs to be changed).

avak@inmet.UUCP (05/17/84)

#R:mit-eddie:-176900:inmet:9200005:000:342
inmet!avak    May 15 16:04:00 1984

The ED (edit) and EDMK (edit and mark) 370 instructions select
three ebcdic non-graphic codes to control the instruction operation,
leaving the remaining ebcdic codes as "message characters". Since the
codes are 0x20, 0x21, and 0x22 (ascii space, exclamation, double quote),
these instructions are not useful when the character set is ascii.

guy@rlgvax.UUCP (Guy Harris) (05/19/84)

> 	"I don't agree that the IBM machine is really geared toward
> 	EBCDIC."

> A program running on an 370-architecture machine can convert a binary
> number to an EBCDIC string in two instructions.  There is no
> corresponding conversion to ASCII, though there was an "ASCII mode" bit
> in the Program Status Word on the 360 line, dropped in the 370 line.

A binary number can, however, be converted to ASCII in three instructions on a
370-architecture machine; "CVD", "UNPACK" or "EDIT" or "EDMK", and "TR".  This
counts as a bit of a kludge in my book, though, because it requires an
intermediate result (the EBCDIC string) which is thrown out after translation
to ASCII; this wouldn't have been necessary with an ASCII mode.

Somebody brought up the point (in one of these discussions) that "EDIT" and
"EDMK" use three codes in the edit pattern string for special purposes, and
that those codes correspond to ASCII space <SP> and two other printable
characters.  Anybody out there with a System/*360* Principles of Operation
(I think we have a 370 PrincOps in house, but it wouldn't help) know what
"EDIT" and "EDMK" did about this in ASCII mode?

	Guy Harris
	{seismo,ihnp4,allegra}!rlgvax!guy

eager@amd70.UUCP (Mike Eager) (05/22/84)

About attaching ASCII terminals to IBM S/370 systems:  Most use an IBM 3704
or 3705.  This is a terminal concentrator & frontend.  It does many wonderful
and magical things.  Given the channel structure of the 370, I doubt that
anyone has a direct connect.  The 3705, by the way, is a IBM 360/50, in 
disguise, unless I've been lied to.

Considering the LARGE number of ASCII constants embedded in C programs, I would
be surprised if the UTS system used EBCDIC internally.  I imagine that there 
would be code dependencies in many programs.

gam@proper.UUCP (Gordon Moffett) (05/22/84)

#
Virtually ALL the application programs on UTS written in C assume
that ASCII is the base character set.  In fact, many of the
programs you are familiar with on other architectures are just
the same on UTS.  (but -- see below about type ``char'').

The ``virtually'' refers to two cases (that I know of) where EBCDIC
is used: in device drivers for EBCDIC-based devices (like 3270's
(ibm tubes)), and programs that read/write volume lables on tapes
or disks.  The drivers are doing EBCDIC <--> ASCII translations, and the
volume labels are artifacts of an Amdahl-compatable environment.

The applications (and for the most part systems) programmer need
never be aware of EBCDIC on UTS.

Oh, by the way, the type ``char'' is unsigned in UTS/370-architecture,
so for all you people who've been writing:

	char c;
	while ((c = getc()) != EOF) ...

... you have frustrated my work very much ....


UTS is a registered trademark of Amdahl Corporation.