nishri@utcsstat.UUCP (Alex Nishri) (05/03/84)
Does anyone have any experience or comments about the dependability of programs written in C on the ASCII character representation? Could most programs written in C be run on a different character representation scheme? What about the Unix system itself? (For a completely different scheme consider EBCDIC. The numerics collate after the alphabetics. So 'a' < '1' in EBCDIC. Also EBCDIC has holes in the alphabetic sequence. Thus 'a' + 1 is equal to 'b', but 'i' +1 is not equal to 'j'. In fact 'i' + 8 equals 'j'.) Alex Nishri University of Toronto ... utcsstat!nishri
rpw3@fortune.UUCP (05/06/84)
#R:utcsstat:-187300:fortune:26900056:000:2297 fortune!rpw3 May 6 03:13:00 1984 [ After this, let's move this to "net.lang.c", shall we? ] Many, many programs I have seen depend on certain characteristics of ASCII, but I am sure it varies by program as to how much of the total sequence is wired in. This has GOT to be a major factor in the cost of porting UNIX to a non-ASCII machine. Most of what I have seen included the at least the following hard dependencies: 1. The numbers are contiguous (no gaps). Kernighan & Ritchie [pp20-21]: "This particular program relies heavily on the properties of the character representation of digits. For example, the test if (c >= '0' && c <= '9') ... determines whether the character in "c" is a digit. If it is, the numeric value of that digit is c - '0' This works only if '0', '1', etc., are positive and in increasing order, and if there is nothing but digits between '0' and '9'. Fortunately, this is true for all conventional character sets." Note particularly the word "all" in that last sentence. Again [page 39], in the sample "atoi(s)", the same assumption is made. 2. The lowercase letters (as a class) are contiguous, as are the uppers. Some programs know that 'A' + 040 == 'a', some don't. Some only depend on 'a' > 'A' (so that 'x' - 'X' is a positive number). Interestingly, most of the programs I have seen DON'T assume any fixed distance between '9' and 'A', but when converting hexadecimal input they adjust for letters by subtracting 'A' - ('9' + 1) from the value of the letter. 3. The ASCII control characters exist, and have values of 'X' - 0100 for any control character <^X> (where 'X' is the upper-case letter of similar appearance). Is is known (for example) that newline == '\n' == 'J' - 0100, and that 'H' - 0100 is a backspace. In sum, many programs assume ASCII, or at least, certain properties of the collating sequence. The ones mentioned above are certainly not a complete list of what you may find when trying to use another character set, but they are a few "biggies". The use of "_ctype[]" can help, but many programs do not use it with consistency. Sorry 'bout that... Rob Warnock UUCP: {ihnp4,ucbvax!amd70,hpda,harpo,sri-unix,allegra}!fortune!rpw3 DDD: (415)595-8444 USPS: Fortune Systems Corp, 101 Twin Dolphin Drive, Redwood City, CA 94065
barmar@mit-eddie.UUCP (Barry Margolin) (05/07/84)
Well, I don't know how dependent Unix really is on ASCII, but the folks at Amdahl probably thought it was dependent enough that they use ASCII, even though the underlying IBM-370 architecture is geared towards EBCDIC. -- Barry Margolin ARPA: barmar@MIT-Multics UUCP: ..!genrad!mit-eddie!barmar
jlw@ariel.UUCP (05/08/84)
>Well, I don't know how dependent Unix really is on ASCII, but the folks >at Amdahl probably thought it was dependent enough that they use ASCII, >even though the underlying IBM-370 architecture is geared towards >EBCDIC. >-- > Barry Margolin > ARPA: barmar@MIT-Multics > UUCP: ..!genrad!mit-eddie!barmar This response should really be in net.arch, but I don't agree that the IBM machine is really geared toward EBCDIC. What is EBCDIC are the line printers, the card punches and readers, and terminals, or, in other words, the peripherals. If you could get your C compiler typed in in ASCII and compiled, then the IBM compare instruction works just fine on ASCII. Its all a big conspiracy and a long chain of source text bootstrapping EBCDIC along. Joseph L. Wood, III AT&T Information Systems Laboratories, Holmdel (201) 834-3759 ariel!jlw
alanr@drutx.UUCP (RobertsonAL) (05/09/84)
My recollection of the only 370 EBCDIC dependencies in the HARDWARE are some instructions for converting strings to packed decimal internal representation. Some floating point conversion algorithms may use these instructions to good effect... ~a -- Alan Robertson ihnp4!drutx!alanr AT&T Information Systems 11900 N Pecos Denver, Colorado 80020
rjh@ihuxj.UUCP (Randolph J. Herber) (05/09/84)
>>Well, I don't know how dependent Unix really is on ASCII, but the folks >>at Amdahl probably thought it was dependent enough that they use ASCII, >>even though the underlying IBM-370 architecture is geared towards >>EBCDIC. >>-- >> Barry Margolin >> ARPA: barmar@MIT-Multics >> UUCP: ..!genrad!mit-eddie!barmar >This response should really be in net.arch, but I don't >agree that the IBM machine is really geared toward EBCDIC. >What is EBCDIC are the line printers, the card punches >and readers, and terminals, or, in other words, the >peripherals. If you could get your C compiler typed in in >ASCII and compiled, then the IBM compare instruction >works just fine on ASCII. Its all a big conspiracy >and a long chain of source text bootstrapping EBCDIC >along. > Joseph L. Wood, III > AT&T Information Systems > Laboratories, Holmdel > (201) 834-3759 > ariel!jlw >My recollection of the only 370 EBCDIC dependencies in the HARDWARE >are some instructions for converting strings to packed decimal internal >representation. Some floating point conversion algorithms may use >these instructions to good effect... > -- Alan Robertson > ihnp4!drutx!alanr > AT&T Information Systems > 11900 N Pecos > Denver, Colorado 80020 IAW GA22-6821 IBM System/360 Principles of Operation: "The sign and zone codes generated for all decimal arithmetic results differ for the extended binary-coded-decimal interchange code (EBCDIC) and the American Standard code for information (ASCII). The choice between the two codes is determined by bit 12 of the PSW. When bit 12 is zero the preferred EBCDIC codes are generated, which are plus, 1100; minus, 1101; and zone 1111. When bit 12 is one, the preferred ASCII codes are generated which are plus 1010; minus 1011; and zone 0101." IAW GA22-7000-8 IBM System/370 Principles of Operation: "COMPATIBILITY BETWEEN SYSTEM/360 AND SYSTEM/370" System/370 is forward-compatible from System/360. A program written for the System/360 will run on the System/370 provided that it:" "2. Does not use PSW bit 12 as an ASCII bit ...." BTW, PSW bit 12 is used in IBM System/370 to control the PSW format. When the bit is zero, the machine and PSW format is in Basic Control mode. When the bit is one, the machine and PSW format is in Extended Control mode. It is my understanding that the main reasons for using ASCII in Amdahl's UTS are efficiency and program reliability. Many programs are coded assuming the characteristics of the ASCII character set. Further details on this topic I do not have. Randolph J. Herber, Amdahl Senior Systems Engineer, ..!ihnp4!ihuxj!rjh, (312) 979-6554, or Cornet Indian Hill x6554, Amdahl Corp., Suite 250, 6400 Shafer, Rosemont, IL 60018, (312) 692-7520
barmar@mit-eddie.UUCP (Barry Margolin) (05/10/84)
-------------------- I don't agree that the IBM machine is really geared toward EBCDIC. What is EBCDIC are the line printers, the card punches and readers, and terminals, or, in other words, the peripherals. -------------------- OK, so it is not the machine, but VM, under which UTS runs, certainly must depend on it. So, any text passed to system calls has to be translated to EBCDIC. -- Barry Margolin ARPA: barmar@MIT-Multics UUCP: ..!genrad!mit-eddie!barmar
rjh@ihuxj.UUCP (Randolph J. Herber) (05/10/84)
>From eli Wed May 9 22:26 EDT 1984 remote from houxq >FROM: e.d.mantel >TO: ihuxj!rjh >DATE: 09 May 1984 22:26 EST >SUBJECT: IBM 360 and ASCII mode > >I'm sure your addendum to the IBM == EBCDIC discussion was intended >to be helpful, but I don't think you succeeded. > >I happen to know what an EC PSW is; most of the people reading the >net have a chance at understanding that packed decimal is a data format, >but are unlikely to know what EC PSWs have to do with ASCII support. > >The point of the prior articles in this discussion was the fact that the >CPU contains very little character code dependency. I presume the point >of your article was: > Once upon a time, IBM had an ASCII mode of operation. All > this did was allow packed decimal operations to be done in > ASCII (whatever it means to have ASCII packed decimal!). A > while ago (when the 370 line was introduced), IBM decided that, > as far as they were concerned, ASCII was not needed, and dropped > support for it. > >As for the reasons for UTS being ASCII-based, this has many advantages. It >is stretching the definition of "reliability" when you really want to say >things like portability and compatibility. As for efficiency, it's such >a minute consideration compared to the portability and compatibility issues >that there's no need to mention it. > > Eli Mantel I thank Eli for his clarification; and, basically, I agree with what he said. I have his permission to quote this private communication. Randolph J. Herber, Amdahl Senior Systems Engineer, ..!ihnp4!ihuxj!rjh, (work) (312) 979-6554 or AT&T Cornet 8-367-6554, (off.) Amdahl Corp., Suite 250, 6400 Shafer, Rosemont, IL 60018, (312) 692-7520
koved@umcp-cs.UUCP (05/10/84)
There appear to be only 2 instructions which are EBCDIC dependent - PACK and UNPACK. These are used to convert EBCDIC digits to packed form which is used for arithmetic, or can be further converted to binary. Conversion from EBCDIC to ASCII and back is trivial...use a translate table (256 bytes) for each direction, and a single instruction (TRanslate) to convert up to 256 bytes at a time. BTW, ASCII terminals are frequently attached to IBM Mainframes. Two methods are possible. The direct attach (or dial-up) is done, and typically supports the IBM 3101 (an ASCII terminal) or other types of TTY. The other method is to attach to an external box (typically the IBM Series/1) which does the conversion between character sets. I hope this clears up some of the confusion. P.S. The only devices which are character set dependent are the display units (3270 type devices), and printers. Some are EBCDIC, and some are not. Codes may be device dependent. -- Spoken: Larry Koved Arpa: koved.umcp-cs@CSNet-relay Uucp:...{allegra,seismo}!umcp-cs!koved
andrew@orca.UUCP (Andrew Klossner) (05/10/84)
"I don't agree that the IBM machine is really geared toward EBCDIC." A program running on an 370-architecture machine can convert a binary number to an EBCDIC string in two instructions. There is no corresponding conversion to ASCII, though there was an "ASCII mode" bit in the Program Status Word on the 360 line, dropped in the 370 line. -- Andrew Klossner (decvax!tektronix!orca!andrew) [UUCP] (orca!andrew.tektronix@rand-relay) [ARPA]
gwyn@brl-vgr.ARPA (Doug Gwyn ) (05/12/84)
Traditionally C has used the host computer "native" character set (how can a convention be "native"? you ask; yet it really is). However many programs written in C implicitly assume that the character set is ASCII, although the language doesn't guarantee this. I seem to recall that the C Language Standards Committee addressed this question but I don't remember whether they decided that ASCII is the "official" C character set. For my own use in those few cases where the character codes are important, I have the following lines in my standard header file: /* integer (or character) arguments and value: */ /* THESE PARTICULAR DEFINITIONS ARE FOR ASCII HOSTS ONLY */ #define tohostc( c ) (c) /* map ASCII to host char set */ #define tonumber( c ) ((c) - '0') /* convt digit char to number */ #define todigit( n ) ((n) + '0') /* convt digit number to char */ The idea is to use toascii() to map the native input characters to internal ASCII form, although you then have to do the same to the C character constants against which the mapped input characters are to be compared (or else use numerical ASCII codes). Then on output one uses tohostc() to map the internal form back to native chars. Obviously there is non-negligible run-time overhead if the host character set is not ASCII but something stupid like EBCDIC, but I am willing to live with this in order to not have to change my source code when I port it to a non-ASCII machine (just the standard header needs to be changed).
avak@inmet.UUCP (05/17/84)
#R:mit-eddie:-176900:inmet:9200005:000:342 inmet!avak May 15 16:04:00 1984 The ED (edit) and EDMK (edit and mark) 370 instructions select three ebcdic non-graphic codes to control the instruction operation, leaving the remaining ebcdic codes as "message characters". Since the codes are 0x20, 0x21, and 0x22 (ascii space, exclamation, double quote), these instructions are not useful when the character set is ascii.
guy@rlgvax.UUCP (Guy Harris) (05/19/84)
> "I don't agree that the IBM machine is really geared toward > EBCDIC." > A program running on an 370-architecture machine can convert a binary > number to an EBCDIC string in two instructions. There is no > corresponding conversion to ASCII, though there was an "ASCII mode" bit > in the Program Status Word on the 360 line, dropped in the 370 line. A binary number can, however, be converted to ASCII in three instructions on a 370-architecture machine; "CVD", "UNPACK" or "EDIT" or "EDMK", and "TR". This counts as a bit of a kludge in my book, though, because it requires an intermediate result (the EBCDIC string) which is thrown out after translation to ASCII; this wouldn't have been necessary with an ASCII mode. Somebody brought up the point (in one of these discussions) that "EDIT" and "EDMK" use three codes in the edit pattern string for special purposes, and that those codes correspond to ASCII space <SP> and two other printable characters. Anybody out there with a System/*360* Principles of Operation (I think we have a 370 PrincOps in house, but it wouldn't help) know what "EDIT" and "EDMK" did about this in ASCII mode? Guy Harris {seismo,ihnp4,allegra}!rlgvax!guy
eager@amd70.UUCP (Mike Eager) (05/22/84)
About attaching ASCII terminals to IBM S/370 systems: Most use an IBM 3704 or 3705. This is a terminal concentrator & frontend. It does many wonderful and magical things. Given the channel structure of the 370, I doubt that anyone has a direct connect. The 3705, by the way, is a IBM 360/50, in disguise, unless I've been lied to. Considering the LARGE number of ASCII constants embedded in C programs, I would be surprised if the UTS system used EBCDIC internally. I imagine that there would be code dependencies in many programs.
gam@proper.UUCP (Gordon Moffett) (05/22/84)
# Virtually ALL the application programs on UTS written in C assume that ASCII is the base character set. In fact, many of the programs you are familiar with on other architectures are just the same on UTS. (but -- see below about type ``char''). The ``virtually'' refers to two cases (that I know of) where EBCDIC is used: in device drivers for EBCDIC-based devices (like 3270's (ibm tubes)), and programs that read/write volume lables on tapes or disks. The drivers are doing EBCDIC <--> ASCII translations, and the volume labels are artifacts of an Amdahl-compatable environment. The applications (and for the most part systems) programmer need never be aware of EBCDIC on UTS. Oh, by the way, the type ``char'' is unsigned in UTS/370-architecture, so for all you people who've been writing: char c; while ((c = getc()) != EOF) ... ... you have frustrated my work very much .... UTS is a registered trademark of Amdahl Corporation.