[comp.lang.c] binary data files

scs@adam.pika.mit.edu (Steve Summit) (04/29/89)

From time to time we debate the relative merits of "binary" and
"ascii" data file formats.  Today I discovered Yet Another Good
Reason not to use binary files, a fact which I was in general
already eminently certain of.  I'll pass the reason on, in case
it will keep anyone else from making the same mistake.

I am working with some code which, for historical reasons, writes
C structures out to files and reads them back in, in the usual
(unrecommended) byte-for-byte binary way, using fwrite and fread.
It happens that the structures contain pointers, the writing of
which to files is an even worse idea, because the pointers are
virtually certain to be meaningless when read back in.  The
struct-writing code is therefore careful to chase down the
pointers and additionally write the substructures and strings
pointed to; the struct-reading code ignores the garbage pointers
read in and replaces them with new pointers to the memory
containing the pointed-to information just read from the file.

This is all well and good, and has been working just fine, but it
happens that it is on the benighted PC, and today I thought I'd
switch from medium model to large model.  Surprise, surprise,
those 16-bit garbage pointers I wrote out yesterday are now
trying to be read in as 32-bit garbage pointers, offsetting the
rest of the data, and several thousand data files are currently
unreadable...

(You don't need to chide me for using binary files in the first
place, a fact for which I am amply chiding myself already, or
suggest ways out of the dilemma, any number of variously
unpalatable ones of which I am already considering.)

                                            Steve Summit
                                            scs@adam.pika.mit.edu

nather@ut-emx.UUCP (Ed Nather) (04/29/89)

I have used both binary and ascii data files in different versions of the
same basic data acquisition program -- binary back when digital data
cassettes were new and floppy disks held a massive 160KB, and ascii
when things loosened up a bit.  Believe me, ascii is better:

1. I can use Unix (or Unix-like) text tools to scan data files and their
corresponding headers; I used to have to write my own tools.

2. Normal, everyday human beings can read the files without having to use
a special translation program.  Printers can print them for scrutiny.

3. I can often whump up a pipeline to do some special processing job, based
on Unix(-like) tools for most of it, and often only a single tool to do the
special stuff as the data flow through.

4. Other computers understand ascii and can read the files without having
to write special conversion routines.

5. I have written conversion routines that turn old, compact binary data
files & headers into ascii, matching the current program's output.  Going
the other way would be unthinkable.

Bad things:

1. Files take a bit longer to read in, since conversion from ascii is now
necessary, but it's a small percentage of the total read time.

2. Files are larger.

3. There are no other bad things.

-- 
Ed Nather
Astronomy Dept, U of Texas @ Austin

cik@l.cc.purdue.edu (Herman Rubin) (04/30/89)

In article <12546@ut-emx.UUCP>, nather@ut-emx.UUCP (Ed Nather) writes:
< same basic data acquisition program -- binary back when digital data
< cassettes were new and floppy disks held a massive 160KB, and ascii
< when things loosened up a bit.  Believe me, ascii is better:

			......................

> Bad things:
> 
> 1. Files take a bit longer to read in, since conversion from ascii is now
> necessary, but it's a small percentage of the total read time.
> 
> 2. Files are larger.
> 
> 3. There are no other bad things.

There is another bad thing.  We may not have a good ASCII representation for
the data.  One example is a multi-font system.  Another example is floating
point data; there is no standard floating point binary, and conversion to and
from decimal is a source of roundoff errors, which may even be serious.

-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

zougas@me.utoronto.ca ("Athanasios(Tom) Zougas") (04/30/89)

In article <12546@ut-emx.UUCP> nather@ut-emx.UUCP (Ed Nather) writes:
>.... Believe me, ascii is better:
>    ...
>Bad things [WRT ascii data file]:
>
>1. Files take a bit longer to read in
               ^^^^^
Have you ever tried reading/writing a 1,000,000 real numbers using
ascii instead of binary? I changed an engineering analysis program
from being intolerably SSSLLLOOOWWW to exceedingly fast by using
binary data files instead of ascii. My boss was happy :-)

Tom.

p.s. This was on a microcomputer, name witheld to protect my
innocence.

-- 
This is my signature:

	tom zougas

bill@twwells.uucp (T. William Wells) (05/01/89)

In article <89Apr30.140219edt.18480@me.utoronto.ca> zougas@hammer.me.UUCP (Athanasios(Tom) Zougas) writes:
: >1. Files take a bit longer to read in
:                ^^^^^
: Have you ever tried reading/writing a 1,000,000 real numbers using
: ascii instead of binary? I changed an engineering analysis program
: from being intolerably SSSLLLOOOWWW to exceedingly fast by using
: binary data files instead of ascii. My boss was happy :-)

Bet you were using scanf! It isn't so bad if you do it yourself, but
scanf and friends are SSSLLLOOOWWW.

---
Bill                            { uunet | novavax } !twwells!bill

poser@csli.Stanford.EDU (Bill Poser) (05/01/89)

I agree that in many cases it is desirable to use ASCII data files,
but in some situations binary is better. One such situation is when
you need to know how many items are in the file before you read it
(say to allocate storage). If the data is binary you just
stat the file and divide by the item size. But if you use ASCII data
there won't be a fixed item size, unless you go to the trouble of
arranging fixed field widths (which also may waste a lot of space),
or arrange to write a header containing the number of items,
which is impractical in some situations.

 

mcdonald@uxe.cso.uiuc.edu (05/01/89)

>.... Believe me, ascii is better:
>    ...
>Bad things [WRT ascii data file]:
>
>1. Files take a bit longer to read in

More than a bit longer. I see no reason not to standardize on a 
standard binary format and use that for data interchange: 
8 bit bytes, two's complement.  On more-than-eight-bit byte
machines, pad with zeroes. Your choice of endianness. For example,
TeX produces a well-defines .dvi file as output, and Metafont
produces standard .gf files, all of which are binary. I have
no difficulty copying these between DEC-20's, IBM-PCs and VAX's
(which are the same for this), and big-endian RISC machines.
Dec-20s have 36 bit words.

Screwing around with endianness etc., repacking bytes, etc., is
a lot faster than ASCIIfication, especially in C.

Doug McDonald

henry@utzoo.uucp (Henry Spencer) (05/02/89)

In article <1271@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
>There is another bad thing.  We may not have a good ASCII representation for
>the data... example is floating
>point data; there is no standard floating point binary, and conversion to and
>from decimal is a source of roundoff errors, which may even be serious.

Binary and decimal do not cover the entire space of possibilities.  The
problem is not in conversion to a text form, it's in base conversion.  If
you are willing to assume that floating point is binary, it's conceivable
to convert floating point to octal or hex rather than decimal.

Also, if you have well-behaved (IEEE) floating point conversions, and allow
enough digits, you may not have roundoff errors.
-- 
Mars in 1980s:  USSR, 2 tries, |     Henry Spencer at U of Toronto Zoology
2 failures; USA, 0 tries.      | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

scs@adam.pika.mit.edu (Steve Summit) (05/02/89)

In article <1271@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
>There is another bad thing.  We may not have a good ASCII representation for
>the data.  One example is a multi-font system.  Another example is floating
>point data; there is no standard floating point binary, and conversion to and
>from decimal is a source of roundoff errors, which may even be serious.

Or it may be innocuous.  The data I typically manipulate is
derived from a medium-quality A/D converter with 4 or 5 digits of
precision; I'd be misleading myself if I were to take steps to
ensure that my data files could handle more.

I'm not sure what the comment about non-ASCII/multi font systems
means.  Is the implication that binary formats (bit-for-bit
copies of the machine's internal floating-point format) are
somehow less susceptible to such portability concerns?

                                            Steve Summit
                                            scs@adam.pika.mit.edu

scs@adam.pika.mit.edu (Steve Summit) (05/02/89)

In article <8758@csli.Stanford.EDU> poser@csli.stanford.edu (Bill Poser) writes:
>I agree that in many cases it is desirable to use ASCII data files,
>but in some situations binary is better. One such situation is when
>you need to know how many items are in the file before you read it
>(say to allocate storage). If the data is binary you just
>stat the file and divide by the item size.

Actually, this illustrates another thing it's worth shying away
from if you can.  The assumption that you can determine, without
actually reading them, exactly how many characters a file
contains, can get you in to trouble, although of course it's a
perfectly valid assumption on Unix systems.  Not so on VMS and
MS-DOS and doubtless other lesser systems -- stat() or the
equivalent may only give you an approximation.

A prime example is Unix tar format: a tar file consists of a file
header, followed by a file, followed by a file header, etc.  The
file header contains the (following) file's size; the size must
be exact because the program reading the tar file must use it to
determine where the file ends and the next header begins.  It's
trivial to write the header correctly on Unix: just stat the
file.  If you're trying to create tar files on other systems (a
reasonable thing to do, since tar is an interchange format) you
typically have to read each file twice: once to count the
characters in it, and a second time to copy it to the tar output
file.

The moral is that if you're writing a program that might be
ported to a non-Unix system, don't depend on the ability to find
a file's size, "in advance," without explicitly reading it.

Getting back to data files, it's not necessary to know how big
they are while reading them.  Just use code like the following:

	int nels = 0;
	int nallocated = 0;
	struct whatever *p = NULL;

	while(there's another item) {
		if(nels >= nallocated) {
			nallocated += 10;
			if(p == NULL)
				p = (struct whatever *)malloc(
					nallocated * sizeof(struct whatever));
			else	p = (struct whatever *)realloc((char *)p,
					nallocated * sizeof(struct whatever));

			if(p == NULL)
				complain;
		}

		read item into p[nels];

		nels++;
	}

If realloc can handle a NULL first argument, you can dispense
with the initial test and call to malloc, and always call realloc
(which is why I'm always ranting in favor of this realloc
functionality, which ANSI C incidentally requires).

The on-the-fly reallocation may look inefficient, but "it doesn't
matter much in practice."  (At least for me.  When I'm really
unconcerned with efficiency, I even skip the nallocated += 10
chunking jazz and call realloc for each item read, and that has
never caused problems either.  Your mileage may vary.)

                                            Steve Summit
                                            scs@adam.pika.mit.edu

sjs@spectral.ctt.bellcore.com (Stan Switzer) (05/02/89)

Another alternative definitely worth looking into is Sun's XDR.  XDR
is available via anonymous FTP from titan.rice.edu (the sun-spots
archives) in "sun-source/rpcsrc.*.*".  If you have NFS you already
have XDR.

XDR is an efficient and rather clever scheme for producing flat,
binary, machine-independent representations of arbitrary C data
structures.  It easily allows for pointer chasing, memory allocation,
unions.

Check it out.

Stan Switzer  sjs@ctt.bellcore.com

john@frog.UUCP (John Woods) (05/03/89)

In article <893@twwells.uucp>, bill@twwells.uucp (T. William Wells) writes:
> In article <89Apr30.140219edt.18480@me.utoronto.ca> zougas@hammer.me.UUCP (Athanasios(Tom) Zougas) writes:
< : >1. Files take a bit longer to read in
> :                ^^^^^
< : Have you ever tried reading/writing a 1,000,000 real numbers using
> : ascii instead of binary? I changed an engineering analysis program
< : from being intolerably SSSLLLOOOWWW to exceedingly fast by using
> : binary data files instead of ascii. My boss was happy :-)
< Bet you were using scanf! It isn't so bad if you do it yourself, but
> scanf and friends are SSSLLLOOOWWW.

Furthermore, you could always use an ASCII representation like
	+.8000000000000+2
to represent 1/2 * 2^2 (or the value 2).  If you are going between machines
with "similar" floating point representations (binary fraction, binary
exponent), then the details of the conversion are fairly simple (especially
if you were thoughtful enough to use a header which specified the format
of the original internal representation).  If your target is something
really peculiar (a decimal computer, say) then it's no worse than converting
"2.000000000E+000", anyway.  A computer with a base-16 floating point format
could handle the binary format with shifts.
-- 
John Woods, Charles River Data Systems, Framingham MA, (508) 626-1101
...!decvax!frog!john, john@frog.UUCP, ...!mit-eddie!jfw, jfw@eddie.mit.edu

dwilbert@bbn.com (Deborah Wilbert) (05/03/89)

I worked as part of a team building a (very) large, multiprocess, distributed
application and was very good about isolating the user interface, and even 
within the user interface, isolating strings used from the rest of the 
presentation code... except for the error log(*). Strings which eventually 
wended their way into the error log originated in any part of the system.

Well, along comes someone who wants to translate the system into Japanese.
"A piece of cake!" was my first reaction (we had all the display tools) and
indeed, translating the user interface WAS a piece of cake... except for
the error log.

In fact, they wanted to be able to run multiple user interfaces to
a running system, some in Japanese, others in English, simultaneously. No 
problem. They also wanted to veiw the log in either English or Japanese, upon
demand. Big problem. Virtually every component had to be changed to report
errors in a binary format rather than a string format and the error log 
itself had to be maintained in binary format to achieve this functionality.
FYI, the binary format involved an error code, and various parameters
associated with the errors.

Of course, the reimplementation necessary in most of the modules was 
unnecessary... we should always have been reporting internally in binary 
format (mea culpa), even if we maintained the original log file in English.
However, (1) a binary output file format would have enforced internal binary
interfaces thus preventing the sloppy coding and (2) the binary format of
the log file was eventually necessary for multilingual access.

If you work on any serious product, you should consider keeping log-style
files in a binary format. When I started work on my project, I had no idea it
would eventually be translated into another language.



							-Deborah 


(*) well, it wasn't really an error log... it was more integral and important
    than an error log, but let's call it an error log.

tneff@bfmny0.UUCP (Tom Neff) (05/03/89)

In article <11021@bloom-beacon.MIT.EDU> scs@adam.pika.mit.edu (Steve Summit) writes:
>Actually, this illustrates another thing it's worth shying away
>from if you can.  The assumption that you can determine, without
>actually reading them, exactly how many characters a file
>contains, can get you in to trouble, although of course it's a
>perfectly valid assumption on Unix systems.  Not so on VMS and
>MS-DOS and doubtless other lesser systems -- stat() or the
 ^^^^^^
>equivalent may only give you an approximation.

MS-DOS has exact filesizes in bytes, and a standard OS call to retrieve
a file's size in bytes.  The poster may have been thinking of CP/M,
but CP/M is not MS-DOS.



-- 
Tom Neff				UUCP:     ...!uunet!bfmny0!tneff
    "Truisms aren't everything."	Internet: tneff@bfmny0.UU.NET

bet@dukeac.UUCP (Bennett Todd) (05/04/89)

In article <11021@bloom-beacon.MIT.EDU> scs@adam.pika.mit.edu (Steve Summit)
writes that assuming you can stat a file for its size breaks down on non-UNIX
systems, and recommends reading into a dynamically grown buffer, which he
grows linearly.

I have often done similar things. The getline() routine in my libbent does a
conceptually similar job for reading arbitrarily long text lines (where you
don't know in advance how many bytes to allocate to last you 'till the next
newline on input). Also, in an image I/O and manipulation library I wrote, I
wanted to be able to read an image from a pipe. I disbelieve in header
parsing, and deduce image dimensions from the file length, so I had to do
roughly the same thing.

However, I am not sure I like the linear reallocation strategy. I would tend
to assume, in general, that realloc would usually be implemented as a series
of malloc/memcpy/free, and thus I try to avoid working it too hard. I found a
binary growth algorithm easy to code, however; basically it looks just like
Steve's linear algorithm except instead of 

	nallocated += 10;

I use 

	nallocated *= 2;

Also, I start with somewhat larger allocations; for getline() I started with
128, and with the image reading facility I start with 65536. Finally, where
you are reading hoping for EOF, by all means issue one big read for reach
realloc, rather than reading along by one or two at a time.

I actually haven't done any performance measurements to determine whether I am
buying any speed with this strategy; however, it isn't much harder to code,
and I am sure on some (if not most) machines the vendor doesn't take enough
care to optimize performance of realloc().

-Bennett
bet@orion.mc.duke.edu

poser@csli.Stanford.EDU (Bill Poser) (05/04/89)

It is true that some operating systems don't allow you to determine the
true file size, but essentially any program that does file i/o,
graphics, or other dealings with the outside world has to make some
assumptions about the OS. After all, some OSs don't provide for
shared memory or execution of child processes or our favorite
graphics system or window system. And some really grody
operating systems have zillions of file types, which generally do not
have exact counterparts in other OSs, so just using text
files (I don't write "ASCII" since some of the nasty OSs in question use
EBCDIC) doesn't make the program portable. For many of us (admittedly
not all) the class of UN*X systems is a large enough target for
portability. And if, as I do, you are dealing with files containing
millions of numbers, the efficiency of using binary data can be a big
win.

mcdonald@uxe.cso.uiuc.edu (05/04/89)

>MS-DOS has exact filesizes in bytes, and a standard OS call to retrieve
>a file's size in bytes.  The poster may have been thinking of CP/M,
>but CP/M is not MS-DOS.


Yes, but that size is the same as the number of characters returned
by fgetc only if the file is binary. Not generally so if it is text.

Doug McDonald

geoff@cs.warwick.ac.uk (Geoff Rimmer) (05/05/89)

Let me throw in my 2c worth.

My "rules" for choosing whether I use a binary file, or an ASCII file
for storing data are as follows:

(1)	If I am storing a file of structs, I always use binary.  This
is because it is faster, and because it makes the code easier to write
and understand.

For example, to delete a struct whose "ref" is set to 999, and write
the new file of structs out elsewhere:

	struct blurfl buf;
	while ( fread ( (char*) &buf, sizeof(struct blurfl), 1, fpr))
	{
		if (buf->ref != 999)
			fwrite( (char*) &buf, sizeof(struct blurfl), 1, fpw);
	}

(2)	If I don't know how many fields will be read in at a time,
ie. one of the fields might determine how many more fields to be read in. 

With (2), I often use strtok(), which I've found very useful:

	while (!feof(fpr))
	{
		char *ptr, str[BUFSIZ];
		if (!fgets(str,BUFSIZ-1,fpr))
			break;
		if (!(ptr=strtok(str," \t")))
			continue;
		do
		{
			printf("%s\n",ptr);
		} while (ptr=strtok((char*)0," \t"));
	}

(In reality, I wouldn't use fgets, since it requires a maximum length
to read, which could result in lost data if there is a particularly
long line.)

I don't believe you can say "always use ascii files" - it just ain't
good enough for some applications.

Geoff

nather@ut-emx.UUCP (Ed Nather) (05/05/89)

In article <225800167@uxe.cso.uiuc.edu>, mcdonald@uxe.cso.uiuc.edu writes:
> 
> 
> >MS-DOS has exact filesizes in bytes, and a standard OS call to retrieve
> >a file's size in bytes.  The poster may have been thinking of CP/M,
> >but CP/M is not MS-DOS.
> 
> 
> Yes, but that size is the same as the number of characters returned
> by fgetc only if the file is binary. Not generally so if it is text.

Sorry, that's not correct.  The exact file size as it is present as an image on
the disk is indeed returned.  If, however, you read a file into memory in text
mode, its size will CHANGE due to removal of '\r' codes by the input routine.
You can avoid this behavior by referring to the text file as binary in the
'fopen' operation.  There are also C commands to set output to binary, if you
wish, so the '\r' codes are not inserted on output.

This whole mess came about because the author(s) of MS-DOS refused to accept
the Unix convention of a single '\n' newline character instead of '\r''\n'.
CPM still lives; its genes are hiding inside MS-DOS.

-- 
Ed Nather
Astronomy Dept, U of Texas @ Austin

les@chinet.chi.il.us (Leslie Mikesell) (05/07/89)

In article <1821@ubu.warwick.UUCP> geoff@cs.warwick.ac.uk (Geoff Rimmer) writes:

>My "rules" for choosing whether I use a binary file, or an ASCII file
>for storing data are as follows:

>(1)	If I am storing a file of structs, I always use binary.  This
>is because it is faster, and because it makes the code easier to write
>and understand.

[...]
>I don't believe you can say "always use ascii files" - it just ain't
>good enough for some applications.

I'm surprised that no one has mentioned this yet, but writing binary data
or structs to disk files results in a file which may require conversion
for use by another machine.  This becomes a serious problem in a networked
environment where it may not be apparent which machine created the file
or when multiple machines need simultaneous access.  It is not that
unlikely that your next machine upgrade will consist of adding more hosts
on a net with access to the current files.  Do you want to bet that they
will have the same byte order, word size, and struct padding?

Les Mikesell

bright@Data-IO.COM (Walter Bright) (05/09/89)

In article <12815@ut-emx.UUCP> nather@ut-emx.UUCP (Ed Nather) writes:
<This whole mess came about because the author(s) of MS-DOS refused to accept
<the Unix convention of a single '\n' newline character instead of '\r''\n'.
<CPM still lives; its genes are hiding inside MS-DOS.

And CPM is based on DEC's RT-11 for PDP-11 computers. At the time, DEC
operating systems were very popular, and they all used the \r and \n
convention. Unix was not nearly so prevalent then. So you cannot fault
CPM for not following the unix conventions. MS-DOS got a head start by
making it easy to port CPM programs over to it. At the time it had
no C compiler or hard disk, so there was no urge to port unix code to it.
All that existed was CPM assembly programs.

All in all, most of the decisions were rational and made sense at the time.
It's rather easy to criticize from 10 years later.

MS-DOS moved into the unix camp and away from DEC with version 2.0. The
big screwup here was using a \ as a separator, rather than /. I suspect
that the reason was that Microsoft used the / as a switch character in
all their application programs.

peter@ficc.uu.net (Peter da Silva) (05/09/89)

In article <1970@dataio.Data-IO.COM>, bright@Data-IO.COM (Walter Bright) writes:
> And CPM is based on DEC's RT-11 for PDP-11 computers. At the time, DEC
> operating systems were very popular, and they all used the \r and \n
> convention.

That's funny... most DEC systems I know (including RSX, which is what CP/M
seems most closely modelled on) store files as a series of variable length
records containing (usually) a 2 or 4 byte header containing the length and
maybe the line number and then the data on the line.

CP/M was actully based directly on an obscure intel DOS called Isis, with
some teminology (PIP, etc) borrowed from DEC.

Isn't this getting a bit far from 'C'?
-- 
Peter da Silva, Xenix Support, Ferranti International Controls Corporation.

Business: uunet.uu.net!ficc!peter, peter@ficc.uu.net, +1 713 274 5180.
Personal: ...!texbell!sugar!peter, peter@sugar.hackercorp.com.