[comp.lang.c] towards a faster isdigit

eggert@twinsun.com (Paul Eggert) (05/08/91)

The traditional implementation of isdigit() in <ctype.h> is typically
something like this:

	#define isdigit(c) ((_ctype_+1)[c] & 4)

which requires indexing through a global array followed by a masking
operation.  Why not use the following implementation instead?

	#define isdigit(c) ((unsigned)((c)-'0') < 10)

This needs just a subtraction followed by a comparison.  It's faster on
all the systems I've tried it on, and is strictly conforming ANSI C.

steve@taumet.com (Stephen Clamage) (05/08/91)

eggert@twinsun.com (Paul Eggert) writes:

>The traditional implementation of isdigit() in <ctype.h> is typically
>something like this:

>	#define isdigit(c) ((_ctype_+1)[c] & 4)

>which requires indexing through a global array followed by a masking
>operation.  Why not use the following implementation instead?

>	#define isdigit(c) ((unsigned)((c)-'0') < 10)

>This needs just a subtraction followed by a comparison.  It's faster on
>all the systems I've tried it on, and is strictly conforming ANSI C.

Maybe you didn't try it on enough systems :-)

The macro you suggest requires a test and jump, and on some modern RISC
machines the penalty is very high.  For example, I tried a loop which
only tested for isdigit and incremented a global based on the test.  On
a Sun-4 (SPARC), the program using your macro took 44% longer than the one
using the usual ctype macro.  On a DECStation (MIPS), your macro took
46% longer.

This points up the principle that you shouldn't diddle your code looking
for small speedups based on characteristics of the compiler or target
machine.  Your attempts may backfire badly when the code is re-used.
(I'm not accusing Paul of diddling, I just like to make this point.)

Apart from that, the other ctype tests (except isascii) are always faster
using the lookup table, and it is simpler to use a uniform set of macros.
-- 

Steve Clamage, TauMetric Corp, steve@taumet.com

wicklund@intellistor.com (Tom Wicklund) (05/08/91)

In <1991May8.030515.7004@twinsun.com> eggert@twinsun.com (Paul Eggert) writes:

>The traditional implementation of isdigit() in <ctype.h> is typically
>something like this:

>	#define isdigit(c) ((_ctype_+1)[c] & 4)

>which requires indexing through a global array followed by a masking
>operation.  Why not use the following implementation instead?

>	#define isdigit(c) ((unsigned)((c)-'0') < 10)

>This needs just a subtraction followed by a comparison.  It's faster on
>all the systems I've tried it on, and is strictly conforming ANSI C.

Two portability problems --

1)  Numeric digits aren't adjacent character codes in all character
sets (though I can't think of an example where this isn't true off
hand).

2) The (unsigned) cast may not work (I think an implementation could
return abs(x) as the result of an (unsigned) cast but am not sure),
so (unsigned)(c-'0') would be TRUE for "c" in the range '0'-9 .. '0'+9.

Note the first "traditional" implementation isn't completely portable
either since it assumes that EOF is -1, not necessarily true.

Two possible performance problems:

1)  Your example is probably faster in an -if- statement, but slower
in a statement such as:

   x = isdigit(y);

2)  On machines with complex addressing modes (e.g. 80x86 and 68xxx)
it should be possible to compile the table driven isdigit into a
single instruction which doesn't use an extra register.  Your example
is a single instruction but requires a register to hold (c - '0'),
which could mean other code won't optimize as well.

I think the "traditional" implementation is done to be consistent with
macros such as isalpha, which is much more efficient as a table (and
letters aren't adjacent codes in EBCDIC or foreign language ISO/ASCII
based character sets).

So both approaches should work on most modern machines and compilers,
so there's nothing wrong with implementing either.

ark@alice.att.com (Andrew Koenig) (05/08/91)

In article <1991May8.030515.7004@twinsun.com> eggert@twinsun.com (Paul Eggert) writes:

> Why not use the following implementation instead?

> 	#define isdigit(c) ((unsigned)((c)-'0') < 10)

> This needs just a subtraction followed by a comparison.  It's faster on
> all the systems I've tried it on, and is strictly conforming ANSI C.

This is surely an implementation issue.

However, it does suggest an interesting puzzle: can you find a
character set that meets the ANSI C constraints (in particular,
the one that says that all digits are contiguous in the character
set) for which this macro will not work?  (Yes, I think there is one)
-- 
				--Andrew Koenig
				  ark@europa.att.com

exspes@gdr.bath.ac.uk (P E Smee) (05/08/91)

In article <1991May8.163547.21677@gdr.bath.ac.uk> P.Smee@bristol.ac.uk (Paul Smee) writes:
>In article <1991May8.030515.7004@twinsun.com> eggert@twinsun.com (Paul Eggert) writes:
>>The traditional implementation of isdigit() in <ctype.h> is typically
>>something like this:
>>
>>	#define isdigit(c) ((_ctype_+1)[c] & 4)
>>
>>which requires indexing through a global array followed by a masking
>>operation.  Why not use the following implementation instead?
>>
>>	#define isdigit(c) ((unsigned)((c)-'0') < 10)
>>
>>This needs just a subtraction followed by a comparison.  It's faster on
>>all the systems I've tried it on, and is strictly conforming ANSI C.
>
>And, it works just fine, if you believe that the chars /.-,+*()' are
>digits.  :-)

Silly me.  Ignore that, if it gets to your site before I manage to cancel
it.  My excuse is simply that I've been in too long today, sigh...

-- 
Paul Smee, Computing Service, University of Bristol, Bristol BS8 1UD, UK
 P.Smee@bristol.ac.uk - ..!uunet!ukc!bsmail!p.smee - Tel +44 272 303132

henry@zoo.toronto.edu (Henry Spencer) (05/09/91)

In article <716@taumet.com> steve@taumet.com (Stephen Clamage) writes:
>>	#define isdigit(c) ((unsigned)((c)-'0') < 10)
>
>The macro you suggest requires a test and jump, and on some modern RISC
>machines the penalty is very high...

No, it requires a test and convert-test-value-to-data operation, which
on some RISC machines does not require a jump.  (I'll avoid putting in
a plug for my favorite one... :-))  Besides, the main use for predicates
like this is in driving conditional jumps, so in fact the conversion
is rarely needed and there is no added jump.
-- 
And the bean-counter replied,           | Henry Spencer @ U of Toronto Zoology
"beans are more important".             |  henry@zoo.toronto.edu  utzoo!henry

worley@compass.com (Dale Worley) (05/09/91)

In article <1991May8.154453.4462@intellistor.com> wicklund@intellistor.com (Tom Wicklund) writes:
   2) The (unsigned) cast may not work (I think an implementation could
   return abs(x) as the result of an (unsigned) cast but am not sure),
   so (unsigned)(c-'0') would be TRUE for "c" in the range '0'-9 .. '0'+9.

No, it can't.  The conversion of a signed integer to an unsigned
integer has to be done by the traditional modular arithmetic.  See
section 3.2.1.2 of the ANSI specification.

Dale

Dale Worley		Compass, Inc.			worley@compass.com
--
Why are you RUNNING?  Cerebus just wants to KILL you a little...
-- "Cerebus"

eggert@twinsun.com (Paul Eggert) (05/09/91)

steve@taumet.com (Stephen Clamage) writes:

>The macro you suggest requires a test and jump, and on some modern RISC
>machines the penalty is very high.  For example, I tried a loop which
>only tested for isdigit and incremented a global based on the test.  On
>a Sun-4 (SPARC), the program using your macro took 44% longer than the one
>using the usual ctype macro.  On a DECStation (MIPS), your macro took
>46% longer.

That's odd.  I tried several systems and got speedups for the proposed
isdigit() every time, if anything even more for RISC than CISC.  E.g.:

    prop    trad    speedup
     2.5     4.5    1.8     Sparcstation 1 SunOS 4.1.1 cc -O
     3.6     5.4    1.5     DECstation 3100 Ultrix V4.0 (Rev. 179) cc -O
     9.7    13.3    1.4     Sun-3/260 SunOS 4.1 cc -O

prop = time with proposed ``#define isdigit(c) ((unsigned)((c)-'0') < 10))''
trad = time with traditional isdigit() in <ctype.h>
speedup = trad/prop = how much faster the proposed isdigit() is

All times are the sum of the user+system CPU time in seconds,
and are measured with the command `time ./a.out 10000000' under csh.
Here's the program I used.

	#include <stdio.h>

	#if traditional
	#   include <ctype.h>
	#else
	#   define isdigit(c) ((unsigned)((c) - '0')  <  10)
	#endif

	main(argc, argv)
		int argc;
		char **argv;
	{
		char *p = argv[1];
		int i = atoi(p), c = *p;
		while (0 <= --i)
			if (isdigit(c))
				c = ++*p;
		printf("%s\n", p);
		return 0;
	}

steve@taumet.com (Stephen Clamage) (05/09/91)

I wrote:

|eggert@twinsun.com (Paul Eggert) writes:

|>Why not use the following implementation instead?

|>	#define isdigit(c) ((unsigned)((c)-'0') < 10)

|>This needs just a subtraction followed by a comparison.  It's faster on
|>all the systems I've tried it on, and is strictly conforming ANSI C.

|Maybe you didn't try it on enough systems :-)

|The macro you suggest requires a test and jump, and on some modern RISC
|machines the penalty is very high.

Bruce Evans pointed out to me that since isdigit() is used in tests,
the test-and-jump is already paid for.  I checked my test code to 
see why I got the results I did, and found they were the opposite
of what I reported.  I'm sorry for wasting everyone's time with this.

Paul's macro is indeed faster.
-- 

Steve Clamage, TauMetric Corp, steve@taumet.com

eggert@twinsun.com (Paul Eggert) (05/09/91)

ark@alice.att.com (Andrew Koenig) writes:

>> 	#define isdigit(c) ((unsigned)((c)-'0') < 10)
>... can you find a character set that meets the ANSI C constraints (in
>particular, the one that says that all digits are contiguous in the
>character set) for which this macro will not work?

Yes: any host where char is the same as int, and where '0' != 0.
The subtraction might overflow.

The following slightly revised isdigit() works around this problem
because unsigned arithmetic cannot overflow:

	#define isdigit(c) ((unsigned)(c) - '0'  <  10)

but frankly I worry that some old buggy compilers will generate the
wrong code for this ``more portable'' version.

dik@cwi.nl (Dik T. Winter) (05/09/91)

In article <20325@alice.att.com> ark@alice.UUCP () writes:
 > In article <1991May8.030515.7004@twinsun.com> eggert@twinsun.com (Paul Eggert) writes:
 > > 	#define isdigit(c) ((unsigned)((c)-'0') < 10)
 > However, it does suggest an interesting puzzle: can you find a
 > character set that meets the ANSI C constraints (in particular,
 > the one that says that all digits are contiguous in the character
 > set) for which this macro will not work?  (Yes, I think there is one)
That is easy.  If your character set uses '5'-s complement to represent
the digits it will fail.  But the one I know that does that uses only 6 bits
for a char, so that violates another ANSI C constraint.
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

andy@research.canon.oz.au (Andy Newman) (05/09/91)

eggert@twinsun.com (Paul Eggert) writes:
> Why not use the following implementation instead?
>
>	#define isdigit(c) ((unsigned)((c)-'0') < 10)

Only works for character sets that have the decimal digits
arranged contiguously and with digit '0' having a lower
numeric value than any other digit character. This works
for ASCII and EBCDIC but what about other character sets?
-- 
Andy Newman (andy@research.canon.oz.au) Canon Info. Systems Research Australia
	"If education were illegal would more people would try it?"

steve@taumet.com (Stephen Clamage) (05/09/91)

wicklund@intellistor.com (Tom Wicklund) writes:

>In <1991May8.030515.7004@twinsun.com> eggert@twinsun.com (Paul Eggert) writes:
>>Why not use the following implementation instead?
>>	#define isdigit(c) ((unsigned)((c)-'0') < 10)

>Two portability problems --

>1)  Numeric digits aren't adjacent character codes in all character
>sets ...

ANSI C requires that '0' thru '9' be contiguous and in that order.

>2) The (unsigned) cast may not work 

ANSI C requires that the unsigned cast work as well-defined modulo
arithmetic, so that any negative value cast to unsigned looks like a
large positive value.

>Note the first "traditional" implementation isn't completely portable
>either since it assumes that EOF is -1, not necessarily true.

EOF and the <ctype.h> functions are supplied as part of the C library.
They must all be consistent.  Any library for which EOF is not -1 would
not use _ctype[(c)+1] to index an array.  We are talking about the
library implementation, not arbitrary user code.
-- 

Steve Clamage, TauMetric Corp, steve@taumet.com

sven@cs.widener.edu (Sven Heinicke) (05/09/91)

In <1991May9.025552.2292@research.canon.oz.au>, andy@research.canon.oz.au writes:
>eggert@twinsun.com (Paul Eggert) writes:
>> Why not use the following implementation instead?
>>
>>	#define isdigit(c) ((unsigned)((c)-'0') < 10)
>
>Only works for character sets that have the decimal digits
>arranged contiguously and with digit '0' having a lower
>numeric value than any other digit character. This works
>for ASCII and EBCDIC but what about other character sets?

I am not saying that is is a bad argument because we don't know what
is to come in the future but are there anyother kinds of systems that
are used besides ASCII and EBCDIC that show promise in the future (I
hope there are, because ASCII could use some work)

-- 
sven@cs.widener.edu                                  Widener CS system manager
Sven Mike Heinicke                                          and Student
(pssmheinicke@cyber.widener.edu (if you must))

rcd@ico.isc.com (Dick Dunn) (05/10/91)

wicklund@intellistor.com (Tom Wicklund) writes:
> 1)  Numeric digits aren't adjacent character codes in all character
> sets (though I can't think of an example where this isn't true off
> hand).

This seems like a weak statement!  If there exists such a character set, it
ought to be possible to discover it, shouldn't it?  (Translation: reality
check!)
-- 
Dick Dunn     rcd@ico.isc.com -or- ico!rcd       Boulder, CO   (303)449-2870
   ...If you plant ice, you're gonna harvest wind.

gwyn@smoke.brl.mil (Doug Gwyn) (05/11/91)

In article <1991May8.154453.4462@intellistor.com> wicklund@intellistor.com (Tom Wicklund) writes:
>1)  Numeric digits aren't adjacent character codes in all character
>sets (though I can't think of an example where this isn't true off
>hand).

Oh, yes they are, at least in terms of the C source character set,
because that is a requirement the standard imposes on a conforming
implementation.  (I argues and voted against that, by the way.)

>2) The (unsigned) cast may not work (I think an implementation could
>return abs(x) as the result of an (unsigned) cast but am not sure),

Type conversions are carefully specified in the C standard; what you
describe is not allowed in a conforming implementation.

sparkman@hydra3c.cs.utk.edu (Donald Sparkman) (05/15/91)

As far as a faster isdigit if you can insure only ascii then 

#define isdigit(c) ((c^0x30)<10)

may be faster since its using bitwise Xor instead of cast and subtraction 

Don Sparkman