[comp.lang.c] isalpha in ctype.h

etxnisj@eos8c21.ericsson.se (Niklas Sjovall) (03/20/91)

Hi,

I want to use a macro defined in ctype.h on a Sun4 (4.03), but i don't
fully understand it.

The macro is:
#define	_U	01
#define	_L	02
extern	char	_ctype_[];
#define	isalpha(c)	((_ctype_+1)[c]&(_U|_L))

It's the part (_ctype_+1)[c] i don't understand. Could there be any
segmentation errors using this?

Thanks
 

gwyn@smoke.brl.mil (Doug Gwyn) (03/21/91)

In article <1991Mar20.112543.5515@ericsson.se> etxnisj@eos8c21.ericsson.se (Niklas Sjovall) writes:
-I want to use a macro defined in ctype.h on a Sun4 (4.03), but i don't
-fully understand it.
-The macro is:
-#define	_U	01
-#define	_L	02
-extern	char	_ctype_[];
-#define	isalpha(c)	((_ctype_+1)[c]&(_U|_L))
-It's the part (_ctype_+1)[c] i don't understand. Could there be any
-segmentation errors using this?

No, in fact that's a quite standard implementation of <ctype.h>.
The +1 is used to allow EOF (defined as -1) to also be used as the
argument to the is*() macros.
Note that some implementations will fail if fed arbitrary garbage
for arguments to the is*() macros, for example any integer value
more negative than -1.

wirzenius@cc.helsinki.fi (Lars Wirzenius) (03/21/91)

In article <1991Mar20.112543.5515@ericsson.se>, etxnisj@eos8c21.ericsson.se (Niklas Sjovall) writes:
> #define	_U	01
> #define	_L	02
> extern	char	_ctype_[];
> #define	isalpha(c)	((_ctype_+1)[c]&(_U|_L))
> 
> It's the part (_ctype_+1)[c] i don't understand. Could there be any
> segmentation errors using this?

Since isalpha is a library function (and a common one at that), there
shouldn't be any errors if you use it correctly, i.e. only give it valid
arguments. In this case, the arguments have to be valid characters or
the value of EOF (as defined in <stdio.h>).

The way this (seems to be) implemented by Sun is: _ctype_ is an array,
which is subscripted with the character argument (henceforth referred to
as c), and each element of the array is a collection of flags that
identify various characteristics of the character, such as whether it is
a letter or not. 

As long as you only need to test real characters, you can simply use
_ctype_[c].  However, isalpha should handle the value of EOF also.  We
could first test whether c == EOF, and use _ctype_ only if it isn't, but
that requires using c twice, which isn't good, because of possible side
effects (isalpha(getchar()) is quite reasonable sometimes). 

What we do instead is define EOF as -1 (we can do that, since we're
writing the whole library), and arrange so that EOF's flags come at the
beginning of the array (_ctype_[0]), then the real characters' flags,
each at an index one greater than the numeric value of the character.
This means that we can write _ctype_[c+1] to access the flags for
character c; EOF is -1 so its flags come at _ctype_[-1+1], i.e.
_ctype_[0]. 

Another way to write the expression is to use pointer arithmetic.  This
is what Sun has done.  The value of the name of an array, _ctype_,
becomes in value contexts a pointer to the first element of the array,
&_ctype_[0].  If we add 1 to this pointer, we get a pointer to the next
element, _ctype_[1].  This pointer is then subscripted with the
character argument, since now the flags for character c are at offset c. 
The flags for EOF are at index -1, which in this case is a valid index,
since it is still inside the real array, _ctype_.  However, subscripting
_ctype_ with -1 (i.e.  _ctype[-1]) is quite illegal, and can very well
result in a segmentation error; the same happens if you call
isalpha(-2).  Exactly what happens depends on the system, I believe
'undefined behaviour' is the phrase used in the ANSI standard for C
(there have been many nice suggestions for this behaviour, ranging from
mailing a complaint to Dennis Ritchie, to launching a nuclear attack;
segmentation errors and system crashes are more normal ones (I hope
:-)). 

-- 
Lars Wirzenius    wirzenius@cc.helsinki.fi

ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (03/21/91)

In article <1991Mar20.112543.5515@ericsson.se>, etxnisj@eos8c21.ericsson.se (Niklas Sjovall) writes:
> I want to use a macro defined in ctype.h on a Sun4 (4.03), but i don't
> fully understand it.

You should read the manual page.  That tells you everything you need to
know in order to USE the macro.  In UNIX, it used to be the case that
the <ctype.h> macros were defined for EOF (-1) and for the integers
which satisfy isascii().  In ANSI C, the macros are defined for EOF
and for any value representable as unsigned char.  Think -1..255.

> It's the part (_ctype_+1)[c] i don't understand. Could there be any
> segmentation errors using this?

(_ctype_+1)[c] is identical to *((_ctype_+1)+(c)) which is
identical to _ctype_[(c)+1].  The +1 is there to map the lowest
legal value EOF (-1) to 0 (the lowest element of the array).
If you had full sources you'd probably find char _ctype_[257];
somewhere.

Yes, of course there can be segmentation errors using this,
if the value of c is outside the range -1 .. UCHAR_MAX, but
you have to keep your subscripts in range for _any_ C array.
-- 
Seen from an MVS perspective, UNIX and MS-DOS are hard to tell apart.

collinsa@p4.cs.man.ac.uk (Adrian Collins) (03/22/91)

In <1991Mar20.112543.5515@ericsson.se> etxnisj@eos8c21.ericsson.se (Niklas Sjovall) writes:

>Hi,

>I want to use a macro defined in ctype.h on a Sun4 (4.03), but i don't
>fully understand it.

>The macro is:
>#define	_U	01
>#define	_L	02
>extern	char	_ctype_[];
>#define	isalpha(c)	((_ctype_+1)[c]&(_U|_L))

>It's the part (_ctype_+1)[c] i don't understand. Could there be any
>segmentation errors using this?

From what I gather _ctype_[] is an array (probably 256 bytes) in length, 
each character has a corresponding entry into the table which contains
information about the type characters suchas if it is printable,
is whitespace, is uppercase, is lowercase.

In the example about it checks to see if the bits corresponding to
either uppercase or lowercase are set.  If either is set then the
character is an alphabetic character.

For some reason the first entry in the array isn't used for holding
character type information (beats me why), in which case the array is
probably 257 in length presuming it isn't null terminated.

Adrian

---
Adrian Collins                              collinsa@uk.ac.man.cs.p4
Department of Computer Science              a.m.collins@uk.ac.mcc
University of Manchester
Manchester,                                 "Let me face the peril"
UK.                                         "No, it's too perilous!"
                                                    - The Holy Grale

john@iastate.edu (Hascall John Paul) (03/25/91)

In article <collinsa.669647114@p4.cs.man.ac.uk> collinsa@p4.cs.man.ac.uk (Adrian Collins) writes:
}In <1991Mar20.112543.5515@ericsson.se> etxnisj@eos8c21.ericsson.se (Niklas Sjovall) writes:
}>I want to use a macro defined in ctype.h ... i don't fully understand it.
}>#define	isalpha(c)	((_ctype_+1)[c]&(_U|_L))
}>It's the part (_ctype_+1)[c] i don't understand.

}For some reason the first entry in the array isn't used for holding
}character type information (beats me why) ...

     The is????? macros are defined over the set (-1 ... 255) hence
the need to offset by 1 to `align' with C's "start at 0" arrays (-1
is for EOF).  This is so stuff like the following works correctly.

          do {
             c = getchar();
                 :
             if (isalpha(c)) fribbles(c);
                 :
          } while (c != EOF);
--
John Hascall                        An ill-chosen word is the fool's messenger.
Project Vincent
Iowa State University Computation Center                       john@iastate.edu
Ames, IA  50011                                                  (515) 294-9551

stan@Dixie.Com (Stan Brown) (03/26/91)

john@iastate.edu (Hascall John Paul) writes:

=>In article <collinsa.669647114@p4.cs.man.ac.uk> collinsa@p4.cs.man.ac.uk (Adrian Collins) writes:
=>}In <1991Mar20.112543.5515@ericsson.se> etxnisj@eos8c21.ericsson.se (Niklas Sjovall) writes:
=>}>I want to use a macro defined in ctype.h ... i don't fully understand it.
=>}>#define	isalpha(c)	((_ctype_+1)[c]&(_U|_L))
=>}>It's the part (_ctype_+1)[c] i don't understand.

=>}For some reason the first entry in the array isn't used for holding
=>}character type information (beats me why) ...

=>     The is????? macros are defined over the set (-1 ... 255) hence
=>the need to offset by 1 to `align' with C's "start at 0" arrays (-1
=>is for EOF).  This is so stuff like the following works correctly.

=>          do {
=>             c = getchar();
=>                 :
=>             if (isalpha(c)) fribbles(c);
=>                 :
=>          } while (c != EOF);

	There was an execelent discussion of this subject about two months
	ago in _C_USERS_ magazine.  It was a part of a serries that
	will eventually cover all the standard headers for an ANSI compliiant
	compiler.


-- 
Stan Brown	P. c. Design 	404-363-2303	Ataant Ga.
(emory|gatech|uunet) rsiatl!sdba!stan           	"vi forever"
"Operating Systems, Like Editors Are Religions" -- Armando Stettner

dds@doc.ic.ac.uk (Diomidis Spinellis) (03/26/91)

In article <collinsa.669647114@p4.cs.man.ac.uk> collinsa@p4.cs.man.ac.uk (Adrian Collins) writes:
>In <1991Mar20.112543.5515@ericsson.se> etxnisj@eos8c21.ericsson.se (Niklas Sjovall) writes:
>
[...]
>>#define	isalpha(c)	((_ctype_+1)[c]&(_U|_L))
> 
>>It's the part (_ctype_+1)[c] i don't understand. Could there be any
>>segmentation errors using this?
[...]
> For some reason the first entry in the array isn't used for holding
> character type information (beats me why), in which case the array is
> probably 257 in length presuming it isn't null terminated.

In this particular implementation _ctype_[0] holds the type value of
the special constant, defined in stdio.h, EOF which happens to have
the value of -1.  Thus _ctype_[0] has the type information for EOF (-1),
_ctype_[1] has the type information for character 0 etc.

Diomidis
-- 
Diomidis Spinellis                  Internet:                 dds@doc.ic.ac.uk
Department of Computing             UUCP:                    ...!ukc!icdoc!dds
Imperial College, London SW7        #define O(b,f,u,s,c,a)b(){int o=f(); ...