[comp.os.minix] New program: gather.c

ast@cs.vu.nl (Andy Tanenbaum) (11/14/89)

I have changed the way I am going to do the update to 1.4b.  My original
plan was to fill a directory full of .cdif files, cat them together into
one huge .cdif file, compress that and uuencode it.    The uuencoded file
is then split into pieces using 'split.'  To update, you collect all the
pieces together, uudecode, decompress and feed the result to patch.  The
problem is that if you miss one posting or run out of disk space, you can't
do anything at all.  

I devised a new scheme.  Now all the .cdif files will be gathered up into
shar files, compressed and uuencoded per shar file, so that each posting
contains some number of whole files.  No files will be split across two
postings, so each one can be processed at the receiving end individually.
To do this, I needed some way of collecting all the loose .cdif files into
shar files of about 60K (after compression and uuencoding).  Technically
this is known as the bin packing problem, and theorists have worked on it
for years.  Being an antitheorist, I figured I could do just as well by
sitting at my terminal for 5 minutes.  Below is a program, gather.c, that
collects all the files in a directory and builds as many uuencoded, compressed
shar archives of a given approximate size as needed to hold all the files.
The bin packing is in heuristic().  If anybody wants to do a real bang up
job improving it, be my guest, but I doubt that it is worth it.

Andy Tanenbaum (ast@cs.vu.nl)

----------------------------- cut here:  gather.c ------------------------
/* gather - collect files for mailing	Author: Andy Tanenbaum */

/* It sometimes happens that one needs to mail a large directory full of
 * files to someone.  This program can be used to collect these files into
 * shar archives, compress and uuencode them. The interesting property that
 * it has is that it makes sure that none of the archives are too big, and
 * that no files are split over two archives.
 *
 * Syntax: gather [-s source_dir] [-d dest_dir] [-b max_arch_size] [-f file]
 *
 *	-s source directory	(where are the files to be sent)
 *	-d destination dir	(where should the archives be put)
 *	-b bytes		(maximum size of the archives; default 60K)
 *	-f file			(use file_00.uue etc as archive names)
 *
 * Examples:
 *	gather				# make 60K archives in this dir
 *	gather  -d mailings -b 50000	# make 50K archives in mailings
 *
 * Note:
 *	The maximum size given by -b (default 60000 bytes) is only an
 *	approximation, since it is hard to tell how big the final file
 *	will be after shar'ing, compressing, and uue'ing.  A heuristic
 *	is used.
 */

#include <sys/types.h>
#include <sys/stat.h>
#include <dirent.h>
#include <fcntl.h>
#include <stdio.h>

#define DEFAULT        60000	/* default archive size */
#define MAX_DIR_ENT      512	/* how many directory entries allowed */
#define HEAP_SIZE      20000	/* storage size for all file names */
#define NAME_SIZE       4096	/* storage size for current command */
#define BASE_SIZE          7	/* max number of chars in basename */
#define PATH_MAX         512	/* largest path name */
#define NUMERATOR       138L	/* heuristic parameter */
#define DENOMINATOR     100L	/* heuristic parameter */

char heap[HEAP_SIZE + 2];	/* dir entries stored here */
char names[NAME_SIZE];		/* file name lists constructed here */
char work[NAME_SIZE];		/* scratch buffer */
char base_name[BASE_SIZE + 1];	/* base name to use for the archives */
char target[PATH_MAX];		/* storage for target file names */

struct dir_ent {
  char *file_name;
  long file_size;
} dir_ent[MAX_DIR_ENT];

long atol(), heuristic();
char *getcwd();

main(argc, argv)
int argc;
char *argv[];
{
/* Parse the command and get ready. */

  int i, counter, l, s, nonlocal;
  char *p, num[3];
  struct dirent *d;
  DIR *dirp;
  struct stat stbuf;
  int first;			/* first entry not used yet */
  int limit;			/* number of files in src_dir */
  char *src_dir = ".";		/* pointer to source directory */
  char *dst_dir = ".";		/* pointer to destination directory */
  char *file = "";		/* name to use */
  long max_bytes = DEFAULT;	/* max archive size (approx.) */
  long cutoff;			/* max cumulative input size */

  if (argc > 9) usage();
  i = 1;
  while (i < argc) {
	/* Examine the i-th argument. */
	p = argv[i];
	if (*p != '-') usage();
	switch (*(p + 1)) {
	    case 's':	src_dir = argv[i + 1];	break;
	    case 'd':	dst_dir = argv[i + 1];	break;
	    case 'f':	file = argv[i + 1];	break;

	    case 'b':
		max_bytes = atol(argv[i + 1]);
		if (max_bytes <= 0) {
			fprintf(stderr, "gather: bad -b value\n");
			exit(1);
		}
		break;

	    default:
		fprintf(stderr, "gather: unknown flag %s\n", p);
		exit(1);
	}
	i += 2;
  }

  /* Determine the basename. */
  get_basename(src_dir, file);

  /* Open the source directory. */
  i = 0;
  p = heap;
  if ((dirp = opendir(src_dir)) == (DIR *) NULL) {
	fprintf(stderr, "gather: cannot open %s\n", src_dir);
	exit(2);
  }

  /* Read in all the file names. */
  while (1) {
	d = readdir(dirp);
	if (d == (struct dirent *) NULL) break;
	l = strlen(d->d_name);
	if (p + l >= &heap[HEAP_SIZE] || i >= MAX_DIR_ENT) {
		fprintf(stderr, "gather: %s is too large\n", src_dir);
		exit(2);
	}
	strcpy(work, src_dir);
	strcat(work, "/");
	strcat(work, d->d_name);
	stat(work, &stbuf);
	if ((stbuf.st_mode & S_IFMT) == S_IFDIR) continue;
	dir_ent[i].file_name = p;
	strcpy(p, d->d_name);
	dir_ent[i].file_size = stbuf.st_size;
	p += l + 1;
	i++;
  }
  limit = i;
  closedir(dirp);

  /* Sort the names. */
  sort_dir(limit);

  /* Figure out when to stop reading files. */
  cutoff = heuristic(max_bytes);

  /* Collect files into archives. */
  first = 0;
  counter = 0;
  while (first < limit) {
	first = collect(first, limit, cutoff);
	num[0] = '0' + (counter / 10);
	num[1] = '0' + (counter % 10);
	num[2] = 0;

	/* Construct full path of compressed target. */
	target[0] = 0;
	if (strcmp(dst_dir, ".") != 0) {
		strcpy(target, dst_dir);
		strcat(target, "/");
	}
	strcat(target, base_name);
	strcat(target, "_");
	strcat(target, num);
	strcat(target, ".Z");

	/* (cd src; shar file ... | compress -fc) >dir/base.00.Z */
	nonlocal = strcmp(src_dir, ".");
	work[0] = 0;
	if (nonlocal) {
		strcat(work, "(cd ");
		strcat(work, src_dir);
		strcat(work, "; ");
	}
	strcat(work, "shar ");
	strcat(work, names);
	strcat(work, " | compress -fc ");
	if (nonlocal) strcat(work, ")");
	strcat(work, " >");
	strcat(work, target);
	s = system(work);
	if (s < 0) {
		fprintf(stderr, "gather: shar command failed\n");
		exit(2);
	}

	/* Uue dir/base.00.Z */
	strcpy(work, "uue ");
	strcat(work, target);
	strcat(work, "\n");
	s = system(work);
	if (s < 0) {
		fprintf(stderr, "gather: shar command failed\n");
		exit(2);
	}

	/* Unlink dir/base.00.Z */
	unlink(target);

	counter++;
  }
}

int collect(first, limit, cutoff)
int first;
int limit;
long cutoff;
{
/* See how many files will fit in an archive. */

  int nr_files;
  long cum_size, size;
  struct dir_ent *p, *endp;

  names[0] = 0;
  p = &dir_ent[first];
  endp = &dir_ent[limit];
  nr_files = 0;
  cum_size = 0;

  while (p < endp) {
	size = p->file_size;
	if (size > cutoff) {
		fprintf(stderr, "gather: %s is too big\n", p->file_name);
		exit(2);
	}

	/* First peek to see if next file fits.  If not, maybe some
	 * other file can be used instead.  Swap them. */
	if (cum_size + size > cutoff) fudge(p, endp, cutoff - cum_size);

	/* If it fails now, there is no file that will fit. */
	size = p->file_size;
	if (cum_size + size > cutoff) return(p - dir_ent);
	strcat(names, p->file_name);
	strcat(names, " ");
	cum_size += size;
	p++;
  }
  return(p - dir_ent);
}

long heuristic(m)
long m;
{
  /* The basic algorithm is to collect files up to some limit, and put
   * them in an archive.  It is tricky to determine how many files to
   * collect, because they will be shar'ed, compressed and uue'ed.
   * Thus we need a heuristic for guessing how to relate the total size
   * of the input files to the size of the final uue archive.  This
   * heuristic is contained in this procedure.  It takes the desired
   * final size as input and produces the file cutoff as output. */

  return((NUMERATOR * m) / DENOMINATOR);
}



sort_dir(limit)
int limit;			/* how many entries in dir_ent */
{
/* Sort the directory using bubble sort. */

  struct dir_ent *p, *q;

  for (p = &dir_ent[0]; p < &dir_ent[limit - 1]; p++) {
	for (q = p + 1; q < &dir_ent[limit]; q++) {
		if (strcmp(p->file_name, q->file_name) > 0) swap(p, q);
	}
  }
}

swap(p, q)
struct dir_ent *p, *q;
{
  /* Exchange two entries. */

  char *cp;
  long l;

  cp = p->file_name;
  l = p->file_size;
  p->file_name = q->file_name;
  p->file_size = q->file_size;
  q->file_name = cp;
  q->file_size = l;
}


fudge(p, endp, size)
struct dir_ent *p, *endp;
long size;
{
/* Look for a file that will fit (i.e., <= size). This fudging gives a more
 * uniform distribution, and reduces the number of files needed.
 */

  register struct dir_ent *q;

  for (q = p + 1; q < endp; q++) {
	if (q->file_size <= size) {
		swap(p, q);
		return;
	}
  }
}


get_basename(s, file)
char *s;
char *file;
{
/* Determine the basename and copy it to base_name. */

  int fd, n;
  char *p, *q;

  if (*file != 0) p = file;
  else if (strcmp(s, ".") == 0) {
	if (getcwd(work, NAME_SIZE) == (char *) NULL) {
		fprintf(stderr, "gather: could not get name of working dir\n");
		exit(2);
	}
	p = work;
  } else {
	p = s;
  }

  q = p + strlen(p) - 1;
  if (*q == '\n') {
	*q = 0;
	q--;
  }
  while (1) {
	if (q < p || *q == '/') break;
	q--;
  }
  strncpy(base_name, q + 1, BASE_SIZE);
}

usage()
{
  fprintf(stderr, "Usage: gather [-b bytes] [-s src_dir] [-d dst_dir] [-f file]\n");
  exit(1);
}

norsk@sequent.UUCP (Doug Thompson) (11/15/89)

In article <4523@ast.cs.vu.nl> ast@cs.vu.nl (Andy Tanenbaum) writes:
>I have changed the way I am going to do the update to 1.4b.  My original
>plan was to fill a directory full of .cdif files, cat them together into

This posting triggered some thoughts for me. As I have not yet purchased
Minix (I am building my own 386 system, and all I need now is the Micronics
386 motherboard and a good 120 Mb disk, I have have everything else so
hardware first then software), and have been reading everyone's various
discussions, I have not yet seen a step-by-step instruction posting on
how to actually upgrade from 1.XX to 1.XX+. I understand about
uuencode/uudecode, compress and shar, but I don't know about 'cdif', where
to get it and how to use it.  I do know UNIX diff.

As postings go by with their diffs, there is no instruction on how to
apply them. So what I am asking is, is there a standard set of instructions
out there that someone has put together that could be posted for all newcomers.
(Remember, in typical service organizations, only 1 customer out of 100
will ask for info, while the other 99 hope the 1 will ask)

I wish to become a constributor to this news group. I really think Minix is
a great tool to learn with and play with. I just need some help on tools
everyone takes for granted.

thanks much

-- 
Douglas Thompson		UUCP: ..{tektronix,ogcvax,uunet}!sequent!norsk
Sequent Computer Systems	Phone: (503) 526-5727
15450 SW Koll Parkway	!"The scientist builds to learn;the engineer learns in
Beaverton OR 97006	!order to build."  Fred Brooks