[comp.unix.sysv386] Converting DOS text files

erc@pai.UUCP (Eric Johnson) (10/15/90)

This is for those of you who have SCO's OpenDesktop with a DOS
under UNIX, or any other DOS under UNIX that has this problem.

The problem is this: when you use a DOS-based copy command to copy a text
file onto your system (from a PC floppy, say), that DOS text file
is full of CR/LFs (instead of the UNIX line feed) and has a trailing
Ctrl-Z. On SCO, there is a program to take care of this, called
dtox. Unfortunately, dtox is a filter. That is, you call it
with something like:

     dtox dosfile > unixfile

This is nice, but I have a big problem. I have 30 to 40 files I
want to un-DOS at a time. I want to be able to type something
like 

    undos *.txt

And have a program go to work stripping all the extra DOS characters
out of the files. In addition, dtox didn't seem to deal with the
trailing Ctrl-Z properly. So, here is my (hacked) solution, undos.c

undos takes all the files on its command line and converts the
format to a UNIX text format (from a DOS text format). It's simple,
dumb, and I'm sure you can come up with a better, more efficient
method. Oh well, it works.

Please note that this is NOT in the public domain. It is copyrighted 
in my name, but I used essentially the very liberal terms of the X Window
copyrights. (More liberal than the GNU public license.) 

You can freely distribute this program so long as you keep my
copyright message intact. There is absolutely no warranty of
any kind with this software--you are on your own.

You should be able to compile undos with a UNIX command like:

   cc -o undos undos.c

I'm posting this in hope it helps save time for others out there.
If it doesn't save you any time, it's not worth your bother.
-Eric

----------------cut here for undos.c---------------------------------------
/*
 * undos.c
 *
 * Copyright 1990 Eric F. Johnson
 *
 * Permission to use, copy, modify, distribute, and sell this software and its
 * documentation for any purpose is hereby granted without fee, provided that
 * the above copyright notice appear in all copies and that both that
 * copyright notice and this permission notice appear in supporting
 * documentation, and that the name of E F Johnson not be used in advertising 
 * or publicity pertaining to distribution of the software without specific,
 * written prior permission. I (Eric Johnson) make no representations about the
 * suitability of this software for any purpose.  It is provided "as is"
 * without express or implied warranty.
 *
 * I DISCLAIM ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL
 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL I
 * BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
 * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION
 * OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN
 * CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
 *
 * Author:  Eric F. Johnson
 *
 * undos.c
 *
 * Program to strip carriage returns and control-Z's
 * from a DOS-based text file. This program acts
 * like the SCO program dtox, but in acts on the
 * file in place, as well as strips the trailing
 * control-Z from the DOS file. By saying this program
 * acts on a text file in place, I mean that it will
 * overwrite the source file.
 *
 * Usage is:
 *      undos file1 file2 file3 ...
 *
 *      Where each file is an ASCII text file in DOS format.
 *
 * 12 October 90
 *
 */


#include  <stdio.h>


main( argc, argv )

int     argc;
char    *argv[];

{       /* main */
        int     i;
        char    *temp_file, *mktemp();

        /*
         * Get a temporary file name
         * to use for storing the       
         * un-DOS-ed file until 
         * we're done.
         */
        temp_file = mktemp( "dosXXXXX" );



        if ( argc < 2 )
                {
                fprintf( stderr, 
                        "Error: Usage is undos dosfile1 dosfile2...\n" );
                }

        for( i = 1; i < argc; i++ )
                {
                printf( "Converting %s to a UNIX text file.\n",
                        argv[i] );

                /*
                 * Remove CR/LFs and Ctrl-Zs
                 */
                undos_file( argv[i], temp_file );


        	/*
        	 * Delete temp_file when done
        	 */
        	unlink( temp_file );
                }


        exit( 0 );
        
}       /* main */


undos_file( source_file, temp_file )

char    source_file[];
char    temp_file[];

{       /* undos_file */
        FILE    *in_file, *outfile;
        int     c;

        in_file = fopen( source_file, "r" );
        outfile = fopen( temp_file, "w" );

        if ( ( in_file == (FILE *) NULL ) || ( outfile == (FILE *) NULL ) )
                {
                fprintf( stderr, "Error in opening files %s or %s\n",
                        source_file, temp_file );
                return( -1 );
                }


        while( !feof( in_file ) )
                {
                c = fgetc( in_file );

                if ( !feof( in_file ) )
                        {
                        if ( ( c == 26 ) ||     /* Ctrl-Z */
                                ( c > '~' ) ) 
                                {
                                c = '\n';
                                }

                        if ( c != '\r' ) 
                                {
                                fputc( c, outfile );
                                }
                        }
                }

        fclose( in_file );
        fclose( outfile );

        /*
         * Now, copy the file back
         */
        in_file = fopen( temp_file, "r" );
        outfile = fopen( source_file, "w" );

        if ( ( in_file == (FILE *) NULL ) || ( outfile == (FILE *) NULL ) )
                {
                fprintf( stderr, "Error in opening files %s or %s\n",
                        source_file, temp_file );
                unlink( temp_file );
                return( -1 );
                }


        while( !feof( in_file ) )
                {
                c = fgetc( in_file );

                if ( !feof( in_file ) )
                        {
                        fputc( c, outfile );
                        }
                }

        fclose( in_file );
        fclose( outfile );
                
        return( 0 );

}       /* undos_file */

/*
 *      end of file
 */

----------------cut here ---------------------------------------

-- 
Eric F. Johnson               phone: +1 612 894 0313    BTI: Industrial
Boulware Technologies, Inc.   fax:   +1 612 894 0316    automation systems
415 W. Travelers Trail        email: erc@pai.mn.org     and services
Burnsville, MN 55337 USA

tneff@bfmny0.BFM.COM (Tom Neff) (10/16/90)

In article <1477@pai.UUCP> erc@pai.UUCP (Eric Johnson) writes:
>dtox. Unfortunately, dtox is a filter. That is, you call it
>with something like:
>
>     dtox dosfile > unixfile
>
>This is nice, but I have a big problem. I have 30 to 40 files I
>want to un-DOS at a time. 

The solution is to learn how to use the shell.

	for f in *.txt
	do
		g=`echo $f | sed -e 's/txt$/out/'`   # sample.txt -> sample.out
		dtox < $f > $g
	done

I bet even SCO supports this construct. ;-)

-- 
War is like love; it always      \%\%\%   Tom Neff
finds a way. -- Bertold Brecht   %\%\%\   tneff@bfmny0.BFM.COM

johnl@esegue.segue.boston.ma.us (John R. Levine) (10/16/90)

In article <1477@pai.UUCP> erc@pai.UUCP (Eric Johnson) writes:
>The problem is this: when you use a DOS-based copy command to copy a text
>file onto your system (from a PC floppy, say), that DOS text file
>is full of CR/LFs (instead of the UNIX line feed) and has a trailing
>Ctrl-Z.  [172 line program follows]

Here's a six-line shell script that does the same thing.  I call it uncr.

#!/bin/sh
# Get rid of carriage returns in files
# Dedicated to the public domain, do anything with it you want. -jrl

for i
do
	echo $i:
	qfile=`dirname $i`/QQ`basename $i`
	mv $i $qfile && tr -d \\015\\032 <$qfile >$i && rm $qfile
done

-- 
John R. Levine, IECC, POB 349, Cambridge MA 02238, +1 617 864 9650
johnl@esegue.segue.boston.ma.us, {ima|spdcc|world}!esegue!johnl
Atlantic City gamblers lose $8200 per minute. -NY Times

jgd@rsiatl.UUCP (John G. DeArmond) (10/16/90)

erc@pai.UUCP (Eric Johnson) writes:

>This is for those of you who have SCO's OpenDesktop with a DOS
>under UNIX, or any other DOS under UNIX that has this problem.

>The problem is this: when you use a DOS-based copy command to copy a text
>file onto your system (from a PC floppy, say), that DOS text file
>is full of CR/LFs (instead of the UNIX line feed) and has a trailing
>Ctrl-Z. On SCO, there is a program to take care of this, called
>dtox. Unfortunately, dtox is a filter. That is, you call it
>with something like:

[program with BIG copyright deleted.]

Please don't take this wrong but your approach, while probably necessary
in a DOS tool-less environment, is terrible for Unix.  Here's how you do
it without any programming.  Get to know Mr. Shell.  He is your friend. 
Here's how:  

for i in `ls *.txt`
do
	# takes care of read-only temp file name collisions
	rm -f /tmp/$i	>/dev/null 2>&1
	tr -d '\032''\015' <$i >/tmp/$i
	if [ -z $? -a -f /tmp/$i]
	then
		mv -f /tmp/$i $i
	else
		rm -f /tmp/$i >/dev/null 2>&1	# just in case
		echo "tr returned an error on file $i"
		exit
	fi
done

If you want to put this in a shell script, simply substitute this for the 
first line:

for i in `ls $*`

What this script does is first execute the command in back-ticks ("ls *.txt")
and then steps through the list of files via the shell variable "i".
Each file is run through tr (translate) invoked in its "dump" mode (-d).
Tr is told to dump ^M (octal 015) and ^Z (octal 032).  The return code
from tr is stored in the shell intrinsic "$?".  If tr is successful,
this value will be 0.  The "if" statement checks to see if tr ran ok AND
if the temporary file was created ok and if so moves the temporary file
back on top of the original.  There are even simpler ways to do this,
but this is what popped out of my head when reading your post.  There are
several unaddressed error conditions in this script, such as when a temp
file name collision occurs and the temp file is not owned by you, but 
these problems are left as an exercise to the reader :-)

You could, of course, use dtox in place of tr but this solution is unix 
vendor-independent.  You could also use sed, awk, Perl (if installed) and
who knows what else.  In other words, get with the Unix tools show, man :-)

Minor programming note.  I don't usually critique coding practices on the
net but in this case I gotta.  Your approach is terribly inefficient,
requiring twice as much system resource as necessary.  Namely, you first
process the input file a character at a time (which is OK for a quick
hack) and then you copy the temp file back onto the input file a
character at a time (NO NO).  The easist way to move the temp file back
onto the original is to use a system() call with mv.  Example: 

sprintf(tmpstr,"mv %s %s"", tmpname, filename);
system(tmpstr);

For a bit of error checking, you could fork() and exec() mv and look at the
return code from wait().  Or, assuming the files are both on the same file
system, you could simply rm() the old file, link() the old name to the 
temp file and rm() the temp file.  That is the most efficient way of doing it.  

While one could (successfully) argue that a system() or fork() system
call would be more expensive than processing small files a byte at a time,
for typical files, this would not be the case.  And for machines that
process I/O system calls slowly (NCR towers come to mind), even small
files would seriously degrade performance, especially if you are doing
a lot of them.

John

John De Armond, WD4OQC  | "The truly ignorant in our society are those people 
Radiation Systems, Inc. | who would throw away the parts of the Constitution 
Atlanta, Ga             | they find inconvienent."  -me   Defend the 2nd
{emory,uunet}!rsiatl!jgd| with the same fervor as you do the 1st.

rcd@ico.isc.com (Dick Dunn) (10/17/90)

erc@pai.UUCP (Eric Johnson) writes about converting DOS text files (with
CR-LF line terminators and final ^Z) for UNIX.

> ...On SCO, there is a program to take care of this, called
> dtox...

Much to my surprise, he's right that SCO really did make it a program. 
Back in the days of UNIX, we would have used one of the little filter
programs (like tr or sed) that came with the system.  Oh well, "forward
into the past" and "programmer's full employment" and all that.

>...Unfortunately, dtox is a filter. That is, you call it
> with something like:
> 
>      dtox dosfile > unixfile

Why is that a problem?  A filter is just slightly more general: You can
apply a filter to files; a program written to handle only files can't be
used in a pipe sequence.  But let's forge ahead...

> This is nice, but I have a big problem. I have 30 to 40 files I
> want to un-DOS at a time. I want to be able to type something
> like 
>     undos *.txt

A big problem?  Why not (in sh notation):
	for f in *.txt
	do cp $f /tmp/d$$
	   dtox /tmp/d$$ >$f
	done
	rm /tmp/d$$

or go all the way to a UNIX approach and replace the dtox line with
	tr -d '\015\032' </tmp/d$$ >$f
-- 
Dick Dunn     rcd@ico.isc.com -or- ico!rcd       Boulder, CO   (303)449-2870
   ...Never offend with style when you can offend with substance.

rpeglar@csinc.UUCP (Rob Peglar) (10/17/90)

In article <1990Oct16.134008.22319@esegue.segue.boston.ma.us>, johnl@esegue.segue.boston.ma.us (John R. Levine) writes:
> In article <1477@pai.UUCP> erc@pai.UUCP (Eric Johnson) writes:
> >The problem is this: when you use a DOS-based copy command to copy a text
> >file onto your system (from a PC floppy, say), that DOS text file
> >is full of CR/LFs (instead of the UNIX line feed) and has a trailing
> >Ctrl-Z.  [172 line program follows]

I give up.

doscp -m a:file.ext dir/file

Rob
-- 
Rob Peglar	Comtrol Corp.	2675 Patton Rd., St. Paul MN 55113
		A Control Systems Company	(800) 926-6876

...uunet!csinc!rpeglar

ken@metaware.metaware.com (ken) (10/17/90)

In article <1990Oct16.134008.22319@esegue.segue.boston.ma.us> johnl@esegue.segue.boston.ma.us (John R. Levine) writes:
>In article <1477@pai.UUCP> erc@pai.UUCP (Eric Johnson) writes:
>>The problem is this: when you use a DOS-based copy command to copy a text
>>file onto your system (from a PC floppy, say), that DOS text file
>>is full of CR/LFs (instead of the UNIX line feed) and has a trailing
>>Ctrl-Z.  [172 line program follows]



Here's yet another solution. This closely emulates Sun's dos2unix program.

#include <stdio.h>
#include <ctype.h>
#include <errno.h>

main(argc, argv)
int argc;
char *argv[];
{
	int c;
	FILE *ifp=NULL, *ofp=NULL;
	extern void exit();

	if (argc != 3 && argc != 2 && argc != 1)
		printf("\n\tUsage:  %s [infile [outfile]]\n\n", argv[0]);
	else {
		switch (argc) {
		case 1:
			ifp = stdin;
			ofp = stdout;
			break;
		case 2:
			if ((ifp = fopen(argv[1], "r")) == NULL) {
				perror(argv[1]);
				exit(errno);
			}
			ofp = stdout;
			break;
		case 3:
			if ((ifp = fopen(argv[1], "r")) == NULL) {
				perror(argv[1]);
				exit(errno);
			}
			if ((ofp = fopen(argv[2], "w")) == NULL) {
				perror(argv[1]);
				exit(errno);
			}
			break;
		}
		while ((c = getc(ifp)) != EOF) {
			if ((c != '
				putc(c, ofp);
		}
		if (ifp != NULL)
			fclose(ifp);
		if (ofp != NULL)
			fclose(ofp);
		exit(0);
	}
	exit(-1);
}

shore@mtxinu.COM (Melinda Shore) (10/19/90)

In article <4339@rsiatl.UUCP> jgd@rsiatl.UUCP (John G. DeArmond) writes:
>While one could (successfully) argue that a system() or fork() system
>call would be more expensive than processing small files a byte at a time,
>for typical files, this would not be the case.  And for machines that
>process I/O system calls slowly (NCR towers come to mind), even small
>files would seriously degrade performance, especially if you are doing
>a lot of them.

I'm not going to get into whether or not the program was great code,
but it's worth pointing out that using stdio *is* a reasonably
efficient general-case approach.  Remember that the library is doing
i/o buffering for you in BUFSIZ chunks, which allows you to do
what looks like single-character processing on top of buffered i/o.
Also, underneath it all, the OS is not going to be doing a disk
read for every read() - that's what the buffer cache is all about.
(It may do a memory/memory copy, but that's another matter.)
Anyway, the point is that you shouldn't be afraid to use stdio
if you're worried about efficiency.

I never use the system() library routine on SCO.  Well, I never use
it anyway (it gets the shell involved and does more than I usually
want done), but it seems to me that it's particularly to be avoided
with SCO because of the way it resets certain signal handlers, in
particular SIGCLD.  You can avoid doing the copy yourself if the
files are on the same filesystem by doing something like
	link(oldfile, newfile);
	unlink(oldfile);
If the files are on different filesystems somebody is going to
have to do the copy, whether it's mv or you do it yourself.  Again,
stdio will handle the buffering for you and doing getc()/putc()
kinds of things isn't inherently inefficient.
-- 
Melinda Shore                                 shore@mtxinu.com
mt Xinu                              ..!uunet!mtxinu.com!shore

emanuele@overlf.UUCP (Mark A. Emanuele) (10/23/90)

In article <232@csinc.UUCP>, rpeglar@csinc.UUCP (Rob Peglar) writes:
> 
> I give up.
> 
> doscp -m a:file.ext dir/file



just try doing doscp -m a:*.* dir    and see what happens.
doscp can't expand wildcards on the dos drive.

what you have to do is this
for i in `dosls a:`
do
	doscp -m a:${i} dir
done
-- 
Mark A. Emanuele
V.P. Engineering  Overleaf, Inc.
500 Route 10 Ledgewood, NJ 07852-9639         attmail!overlf!emanuele
(201) 927-3785 Voice   (201) 927-5781 fax     emanuele@overlf.UUCP