[alt.sources] Sorting Algorithm

wayned@wddami.spoami.com (Wayne Diener) (08/17/90)

I have been unable to find any references to the sorting algorithm
used in this program.  If anyone has seen it elsewhere, I would 
appreciate hearing about it.  None of my "guru" programmer friends,
my professors (I'm a hardware type who went back to college to learn
a little software) or a literature search have turned up an equivalent
so I'm posting it here to see if any one already knows of it.  If it's
original with me, I claim a copywrite on it and release it for use
by anyone.  I'm certain it can be improved - feel free to do so, but
send me a hard copy via US mail, OK?

I have done empirical comparisons of this sort against bubble,
insertion, selection and queuemerge sort.  It's a lot faster than
the first three but slower than the queuemerge sort, however, it
does the sort "in-place" (requires very little additional memory,
other than a few more variables).  I'm not really terribly proficient
at "Big O" analysis, but it LOOKS like it might be O(N*log(N))
instead of O(N^2).  Anyone want to analyse it?  

I 'trimmed' this file from the original form (massive documentation
for class requirements) to include only what I think is really 
useful.  I don't think I cut out anything important, but I can't
be 100% sure since I haven't re-compiled.  The sorting algorithm itself 
should be okay.  I compiled and ran the program using Turbo C, 
cc on a Sun 386i and under Ultrix, so it should work in most environments. 

The sorting is accomplished using a binary search of a linked list
to find the the proper insertion point (just running up and down
the pointers and dividing the list in half each time) and then 
moving the pointers to change an item's location.

Have fun, and not too many flames, OK? (Remember this was a class
assignment (for string manipulation actually) and I had to demonstrate
some concepts other than just the sorting.)


X-----------------------  CUT  ----------------------------------------X
/***********************************************************************
   A "Binary Insertion Sorting Technique for Linked-Lists"
   Wayne D. Diener
   South 3415 Myrtle
   Spokane, WA  99223
   (509) 535-4670

	This program reads a file of input words (as you might type
	them in with any programming editor), prints the word count
	and the list of words, sorts the list and then prints the
	list again.

    (Oh, yea...  Bi_In_Sort() is the actual function)


   *inp         FILE          Used to read the characters from the input
                              file.
   ch           char          Used for input to "hold" the screen long
                              enough to see the results.
   *mnlst       main_list    A pointer to the header for the list.

************************************************************************/

#include <stdio.h>
#include <fcntl.h>
typedef char data;

typedef struct node              /* Each node contains a character. */
{
   data character;
   struct node *next_letter;
   struct node *prev_letter;
} node;

typedef struct header            /* Each header starts a word. */
{
   int letter_count;
   struct node *word_head;
   struct node *word_tail;
   struct header *next_word;
   struct header *prev_word;
} header;

typedef struct main_list         /* This is the main list header. */
{
   int word_count;
   struct header *head_word;
   struct header *tail_word;
} main_list;


main(argc,argv)

   int argc;
   char *argv[];

{
   FILE       *inp,*fopen();
   int        ch;
   main_list  *mnlst;
   void       read_list();
   void       print_list();
   void       erase_list();
   void       Bi_In_Sort();
   int        compare_words();

   if (argc != 2)
   {
      printf("Error.  Useage:  %s filename",argv[0]);
      exit(1);
   }

   if ((inp = fopen(argv[1],"r")) == NULL)
   {
      printf("Could not open %s for input.",argv[1]);
      exit(1);
   }
   mnlst = (main_list *) malloc (sizeof(main_list));

   read_list(mnlst,inp);  /* Read the words into a list. */

   fclose(inp);     /* Close the input file. */

   printf(" Word count = %d\n",mnlst->word_count);

   printf("\n The input word list printed forward is:\n\n");
   print_list(mnlst->tail_word,0);  /*It's recursive so start at wrong end*/

   Bi_In_Sort(mnlst,compare_words);              /* Sort the list. */

   printf("\n The sorted word list is:\n\n");
   print_list(mnlst->tail_word,0);
   printf("\n\n\n Press return to continue.\n");
   scanf("%c",&ch);      /* Leave the screen up until a <cr> */
   erase_list(mnlst);    /* Clean up the memory. */

}

void
print_word(ptr)
   header *ptr;
/***********************************************************************
   print_word() - accepts a pointer to the header of a word it prints
   out all the characters contained in the list at that node and then
   prints a space character.

   variables used:
   name         type                     Description
   -------------------------------------------------------------------
   *p           node           Points at the characters to print.
   i            int            A loop control variable that counts letters.
***********************************************************************/
{
   node *p = ptr->word_head;
   int i = ptr->letter_count;
   while (i-- != 0)
   {

      printf("%c",p->character);
      p = p->prev_letter;
   }
   printf("%c",' ');     /* Put a space after it. */
}

void
print_list(point,dir)
   header *point;
   int   dir;
/***********************************************************************
   print_list() - accepts a pointer to one of the word nodes on the list
   and a variable that determines which direction to print (forward or
   reverse).  It then traverses the list of words recursively and prints
   the word contained at each node.

   variables used:
   name         type                     Description
   -------------------------------------------------------------------
   none
***********************************************************************/
                         /* If dir = 0 we'll print the list normally. */
                         /* If dir != 0 we'll print backward. */
{                        /* It works either direction. */
   void Print_word();

   if((dir ? point->prev_word : point->next_word) != NULL)
      print_list((dir ? point->prev_word : point->next_word),dir);
   print_word(point);
}

void
erase_list(list)
   main_list *list;
/***********************************************************************
   erase_list() -  accepts a pointer to the main word list.  It traverses
   the list erase each word list associated with the node, goes to the
   next node and erases the previous header.  Finally it erase the last
   word node and then the header for the main list.

   variables used:
   name         type                     Description
   -------------------------------------------------------------------
   *p            header     Points at the word node to be erased.
***********************************************************************/
{
   header  *p = list->head_word;
   void erase_word();
   while ( p != NULL)  /* p is not passed list->tail_word*/
   {
      erase_word(p);        /* Erase the word. */
      p = p->prev_word;     /* Go to the next word. */
      if (p != NULL)
         free(p->next_word);   /* Free the previous word header. */
   }
   free(list->tail_word);  /* Free the last word header. */
   free(list);             /* Free the list header. */
}

void
erase_word(word_node)
   header *word_node;       /* word_node is the header for the word. */
/***********************************************************************
   erase_word() -  Accepts a pointer to a word node.  It traverses the 
   list of character nodes and frees the memory associated with each
   character.

   variables used:
   name         type                     Description
   -------------------------------------------------------------------
   *p           header        A helper pointer.
   *q           header        The pointer used to free the memory.
   i            int           A loop counter used to count the letters.
***********************************************************************/
{
   node *p,*q = word_node->word_head;       /* p points at the letters. */
   int i;
   for (i=0;i < word_node->letter_count;i++)
   {
       p = q->prev_letter;   /* Save the next letter pointer. */
       free(q);              /* Free the letter. */
       q = p;                /* Point at the next letter. */
   }
}

int
compare_words(first,second)
   header   *first,*second;
/***********************************************************************
   compare_words() -  Accepts two pointer to word headers.  It compares
   the letters contained at each node of the word in succession.  If it
   encounters a letter in one word that is greater than the corrsponding
   letter in the other word, it returns the appropriate value.  If the end
   of either (or both) word(s) is reached a determination of the longer
   of the two words is attempted by comparing the lengths of the words.
   If the lengths are different, the function returns the appropriate
   value, if the word lengths are the same, it returns the "equal value".

   variables used:
   name         type                     Description
   -------------------------------------------------------------------
   *p           header        Points to the header of the first word
                              to compare.
   *q           header        Points to the header of the second word
                              to compare.

***********************************************************************/
/*  if first > second, return 0.  If second > first, return 2
if first = second, return 1 */

{
   node *p = first->word_head,*q = second->word_head;

   while ((p != NULL) && (q != NULL))  /* As long as letters are there. */
   {
      if ( p->character > q->character)        /* First > Second. */
         return(0);
      else if ( p->character < q->character)   /* Second > First. */
         return(2);
      else                                     /* Equal so far! */
      {
         p = p->prev_letter;     /* Go to the next letters. */
         q = q->prev_letter;
      }
   }
 /* To get here, one or both of the words is out of letters and they
 are equal to this point. */

   if (first->letter_count > second->letter_count)       /* First > */
      return(0);
   else if (first->letter_count < second->letter_count)  /* Second > */
      return(2);
   else return(1);          /* The words are equal. */
}

void
Bi_In_Sort(big_list,compare)
   main_list *big_list;
   int    (*compare)();

/***********************************************************************
   Bi_In_Sort() - Accepts a pointer to the header of a list to process.
   First, a sorted portion is created at the end of the list that is 2
   items long then a loop is entered that repeatedly takes the next item
   at the "head" of the list and uses a binary search of the items in the
   sorted portion to determine the correct location for the new item.
   The new item is then removed from the head of the list and inserted
   at the appropriate location.  This process is repeated until the last
   item has been processed.

   variables used:
   name         type                     Description
   -------------------------------------------------------------------
   test         int           Used as a flag to signal the instance
                              where the new item should be inserted
                              prior to the present smallest member of
                              the list.
   count        int           Used to keep track of the number of words
                              already in the sorted portion of the list.
   middle       int           Used as a counter control variable to
                              determine how far "up" or "down" the list
                              to travel during the binary search.
   i            int           The count control variable for list traversal.
   up           int           Used as a boolean control variable to determine
                              if the next movement on the list should be
                              "up" the list or "down" the list.
   *current     header        The pointer that is moved "up" and "down" the
                              list while searching for the proper insertion
                              location.
   *newitem     header        A pointer that is used as a "handle" during
                              the movement/insertion of the head of the list.
   *sortbound   header        A pointer that points to the "lowest" item of
                              the sorted portion of the list.

***********************************************************************/
{
   int      test,count=1,middle,i,up;
   header   *current,*newitem,*sortbound;
   void     insert();

   if (big_list->word_count > 1)  /* A one item list is already sorted. */
   {
      current = big_list->tail_word->next_word;
      if ( (*compare)(current,big_list->tail_word) == 0)
      {
         current->next_word->prev_word = big_list->tail_word;
         big_list->tail_word->next_word = current->next_word;
         current->next_word = big_list->tail_word;
         big_list->tail_word->prev_word = current;
         current->prev_word = NULL;
         big_list->tail_word = current;
      }
      /*   The sorted part is now two items long. */
      sortbound = big_list->tail_word->next_word;
      do
      {
         up = 1;                         /*"Outside loop" initializations.*/
         newitem = big_list->head_word;
         count++;
         middle = (count +1) / 2;
         current = sortbound;
         test = 0;
         do
         {
            for ( i=0; i < middle ; i++)
            /* Go to the appropriate "middle". */
            {                          /* Either up or down the list. */
               current = up ?   current->prev_word : current->next_word;
               if (current == sortbound)
               {
                  test = 1;          /* If we get to the sort boundary, */
                  break;             /* we have to quit. */
               }
            }
            if (((*compare)(newitem,current) == 0) && current != NULL)
            {
               if (current == big_list->tail_word)
               /* Place the item at the tail. */
               {
                  big_list->head_word = newitem->prev_word;
                  big_list->head_word->next_word = NULL; /* the new head */
                  newitem->prev_word = NULL; /* the new end of the list */
                  newitem->next_word = big_list->tail_word;
                  big_list->tail_word->prev_word = newitem;
                  big_list->tail_word = newitem;
                  break;
                  /* The sortbound stays in the same place. */
               }
               else
                  up = 1;     /* Otherwise we have to look further "up". */
            }
            /*  The case of inserting after the tail_word is finished. */
            else if(test)   /* To get here, newitem <= current */
                            /* and current = sortbound */
                            /* so we insert before current */
                            /* and move sortbound to the new item. */
            {
               if ( current == newitem->prev_word)
                  return;   /* This is the actual finishing point. */
               else
               {
                  insert(big_list,newitem,current);
                  sortbound = newitem;
                  break;
               }
            }
            /*  Inserting before sortbound is finished. */
            /*  To get here, newitem <= current and somewhere in the middle*/
            else if ( (*compare)(newitem,current->next_word) <= 1)
            {
               insert(big_list,newitem,current);
               break;
            }
            else
            /*  newitem is strictly less than current->next, so: */
            /*  go down the list.  */
               up = 0;
       middle /=2;
       if (middle == 0)
          middle++;
       }
       while (1);
    }
    while (sortbound != big_list->head_word);
    }
}

void
read_list(ml,inp)
   main_list *ml;
   FILE     *inp;
{
   node       *dat;
   header     *list;
   char       a, ch = 32;    /* This makes the initial while loop work. */
   void       print_word();

   while ((ch == 32) || (ch == 13) || (ch == 10))
      ch = getc(inp);  /* This halts formation of words from */
   ungetc(ch,inp);   /* leading spaces, etc. (no letters). */

   if((ch = getc(inp)) != EOF)
   {
      dat = (node *) malloc (sizeof(node));
      list = (header *) malloc (sizeof(header));
      ml->head_word = list;      /* Points to the head of the whole list. */
      ml->tail_word = list;      /* Points to the tail of the whole list. */
      ml->word_count = 1;        /* There's at least one word in the list. */

      list->letter_count = 1;    /* There's at least one letter in the word. */
      list->word_head = dat;     /* This points at the first letter. */
      list->word_tail = dat;     /* This points at the last letter. */
      list->next_word = NULL;    /* This is the only word so far. */
      list->prev_word = NULL;


      dat->character = ch;  /* This is the first letter of the first word. */
      dat->next_letter = NULL;     /* It's also the only letter right now. */
      dat->prev_letter = NULL;

      while(ch != EOF)
      {
         if((ch=getc(inp)) != EOF)
         {
            if((ch == 32) || (ch == 13) || (ch == 10))
            {                   /* New word.  Make a new word header node. */
                                /* The following while, ungetc and if(ch..) were also necessary
                                        if allowing <cr> and <lf> as word seperators. */
               while ((ch == 32) || (ch == 13) || (ch == 10))
                                                ch = getc(inp);  /* This halts formation of words from */
               ungetc(ch,inp);   /* extra spaces, etc. (no letters). */
                                 /* but put the last character back. */
               if (ch == EOF)    /* This protects agains a final word */
                                                break;         /* being generated at EOF. */
               list = (header *) malloc (sizeof(header));
               /* link it into the main_list. */
               list->prev_word = NULL;               /* It's the new end. */
               ml->tail_word->prev_word = list;
               list->next_word = ml->tail_word;
               ml->tail_word = list;    /* Adjust the end of the word list. */
               ml->word_count++;        /* Increment the word count. */
               list->letter_count = 0;  /* No letters in the word yet. */
            }
            else if((ch != 10) && (ch != 13))
            /* Disallow <cr> and <lf> as actual parts of the words. */
            {
               dat = (node *) malloc (sizeof(node));
               dat->prev_letter = NULL; /* This is the new end of the word*/
               dat->character = ch;     /* Place the letter in the node. */
               if (list->letter_count == 0)  /* No letters in the word yet*/
               {
                  list->word_head = dat;   /* Point at the end of the word*/
                  dat->next_letter = NULL; /* The first letter has no next*/
               }
               else
               {
                  list->word_tail->prev_letter = dat; /* Link to end word */
                  dat->next_letter = list->word_tail;
               }
               list->letter_count++;  /* Increment the letter count. */
               list->word_tail = dat; /* The new end of the word. */
               }
         }
      }
   }
}

void
insert(bg_lst,new,cur)      /* These are the common statements required*/
   main_list *bg_lst;       /* for any insertion prior to current pointer*/
   header    *new,*cur;
/***********************************************************************
   insert() - Accepts the pointers to the main list and the current
    and newitem and performs an isertion of the new item prior to the
    current location.

   variables used:
   name         type                     Description
   -------------------------------------------------------------------
   none
***********************************************************************/
{

   bg_lst->head_word = new->prev_word;
   bg_lst->head_word->next_word = NULL;
   new->next_word = cur->next_word;
   cur->next_word->prev_word = new;
   cur->next_word = new;
   new->prev_word = cur;
}

wayned@wddami.spoami.com (Wayne Diener) (08/17/90)

Apparently there is some difficulty in getting mail to me.
(I AM somewhat of a novice at UUCP)

Mr. Victor Kan of Data General Corporation managed to get mail to me with
the following address. (And I thank him for his comments and suggestions)

   isc-br!hawk!wddami!wayned@uunet.uu.net

If anyone has tried to respond to my posting of a (hopefully) new
sorting algorithm and it bounced, you might try the above.

Thanks.

Wayne Diener

wayned@wddami.spoami.com (Wayne Diener) (08/21/90)

>In article <15890@oolong.la.locus.com> geoff@ITcorp.com (Geoff Kuenning) writes:
>In article <wayned.0092@wddami.spoami.com> wayned@wddami.spoami.com
>(Wayne Diener) writes:
>
>> I have been unable to find any references to the sorting algorithm
>> used in this program.  If anyone has seen it elsewhere, I would 
>> appreciate hearing about it.  None of my "guru" programmer friends,
>> my professors (I'm a hardware type who went back to college to learn
>> a little software) or a literature search have turned up an equivalent
>
>Did your professors point out that you are basically just building a
>binary tree in an expensive fashion?
>
>> does the sort "in-place" (requires very little additional memory,
>> other than a few more variables).  I'm not really terribly proficient
>> at "Big O" analysis, but it LOOKS like it might be O(N*log(N))
>> instead of O(N^2).  Anyone want to analyse it?  
>
>Most people would consider list links to be additional memory.  As to
>complexity, it depends tremendously on the cost of comparisons.  If
>comparisons are costly, then the complexity is indeed O(N log N).  On
>the other hand, if comparisons are cheap compared to the cost of
>following pointers (e.g., you are sorting 32-bit integers), then the
>complexity becomes at least O(N**2) because you will spend all of your
>time chasing up and down the list to do your "binary search".  (This
>shows up one of the many deficiencies in "Big O" analysis).
>
>> The sorting is accomplished using a binary search of a linked list
>> to find the the proper insertion point (just running up and down
>> the pointers and dividing the list in half each time) and then 
>> moving the pointers to change an item's location.
>
>This is very similar to building a binary tree, or possibly a balanced
>binary tree.  Binary tree insertion is O(N log N);  I think balanced trees
>have similar complexity but I don't have a reference handy to verify this
>at the moment.  The way your code does the binary search is potentially
>very expensive compared to the cost of keeping the tree balanced, especially
>with very large data bases.
>
>Having said all the above, let me add that I do think your algorithm is
>interesting, and it was thoughtful of you to post it.  In some situations,
>your algorithm will be the superior choice, and it may also inspire some
>interesting derivatives.
>
>	Geoff Kuenning	geoff@la.locus.com	geoff@ITcorp.com

     Thanks for your response...  Yes, I know that the efficiency of the
     sort depends a great deal on the cost of the comparisons vs the
     cost of the pointer manipulations.  I managed to get that far in
     the analysis myself, but I wrote the thing as a homework assignment
     for a first course in data structures (hadn't gotten to Btrees
     general trees, etc. yet), but it _did_ seem like an interesting
     approach, so when I ran across it the other day, I thought I'd
     just throw it out to the community at large and see what happened.
     I'm sure it can be improved substantially.

     This is the body of the first response I've received suggesting
     a possible method of improving the efficiency.

     *********************************

|     In-Reply-To: wayned@wddami.spoami.com's message of 16 Aug 90 23:00:00 GMT
|From: Victor Kan <kan@DG-RTP.DG.COM>
|To: isc-br!hawk!wddami!wayned@uunet.uu.net
|Subject: Sorting Algorithm
|
|
|[ I tried sending to wayned@wddami.spoami.com but my nameserver can't
|  find it, so I tried this other path.
|]
|
|I can't say that I've ever seen it done this way.  If even your
|professors can't recognize it after serious analysis, maybe you should
|send mail to Knuth at Stanford.
|
|I have a problem with the code though (probably why I've never seen it
|done in text books).
|
|Your "binary search of a linked list" isn't log(N) as the usual
|bsearch of an array would be since you must do that non-trivial
|"boogie through the list" (as my algorithms professor called it :-) to
|get to the middle.  Although you don't do comparisons in that loop, it
|might still add some complexity (I never took a course in formal
|algorithmic analysis, only a fundamental algorithms course).
|
|On the average case, you must do more than M/2 pointer traversals
|(where M is the size of the sorted-so-far list, and I think it's
|really sum{i=1..log(M) of M/2^i} traversals) in get-to-the-middle
|operations, right?
|
|That's at least sum(1..N)/2 = ((N^2+N)/2)/2 = O(N^2+N)
|
|to sort the entire set.  And that must be added to the N log(N) actual
|comparisons done.  It may have run faster than the other O(N^2) sorts
|because they actually do N^2 comparisons, whereas you do N log(N)
|comparisons and N^2+N cheap traversals.  But in terms of complexity, I
|think (can't be sure) those N^2+N traversals must be included.
|
|So in my very humble analysis, your sort is O(N Log(N) + N^2 + N)
|where the latter two terms should be multiplied	by some constant
|factor in your time analysis to account for cheaper (i.e.
|non-comparison) operations.
|
|Below, I offer a possible speed up and complexity elimination step
|which might make the sort O(N Log(N) + some other log expr).
|
|However, since I thought of it on company time and mailed it with
|company resources, the suggestion may be considered copyrighted by my
|company, Data General Corporation.
|
|If you use it and become famous for inventing a new O(N Log(N)) sort,
|please credit me and my company :-).
|
|There may be some ways to eliminate the O(N^2+N) cheap traversals to
|the middle by maintaining an array of "middle" pointers to point to
|the various "middles" in your linked list.  However, maintaining this
|array is complicated.  Optimally, you must "know" the unsorted list
|ahead of time so you can assign the middle pointers correctly in the
|array so you don't have to shuffle them around.  But if you don't know
|the properties of the list, you could actually do this shuffling
|pretty cheaply since there are only log(N) pointers to shuffle, and
|worst case shuffling would be similar to quicksort complexity on that
|short log(N) sized list (i.e. P log(P) average, P^2 worst where P =
|log(N)).
|
|
| Victor Kan               | I speak only for myself.               |  ***
| Data General Corporation | Edo emacibus, ergo sum.                | ****
| 62 T.W. Alexander Drive  | Columbia Lions Win, 9 October 1988 for | **** %%%%
| RTP, NC  27709           | a record of 1-44.  Way to go, Lions!   |  *** %%%

     *******************************************


--
Wayne Diener

geoff@ITcorp.com (Geoff Kuenning) (08/22/90)

In article <wayned.0092@wddami.spoami.com> wayned@wddami.spoami.com
(Wayne Diener) writes:

> I have been unable to find any references to the sorting algorithm
> used in this program.  If anyone has seen it elsewhere, I would 
> appreciate hearing about it.  None of my "guru" programmer friends,
> my professors (I'm a hardware type who went back to college to learn
> a little software) or a literature search have turned up an equivalent

Did your professors point out that you are basically just building a
binary tree in an expensive fashion?

> does the sort "in-place" (requires very little additional memory,
> other than a few more variables).  I'm not really terribly proficient
> at "Big O" analysis, but it LOOKS like it might be O(N*log(N))
> instead of O(N^2).  Anyone want to analyse it?  

Most people would consider list links to be additional memory.  As to
complexity, it depends tremendously on the cost of comparisons.  If
comparisons are costly, then the complexity is indeed O(N log N).  On
the other hand, if comparisons are cheap compared to the cost of
following pointers (e.g., you are sorting 32-bit integers), then the
complexity becomes at least O(N**2) because you will spend all of your
time chasing up and down the list to do your "binary search".  (This
shows up one of the many deficiencies in "Big O" analysis).

> The sorting is accomplished using a binary search of a linked list
> to find the the proper insertion point (just running up and down
> the pointers and dividing the list in half each time) and then 
> moving the pointers to change an item's location.

This is very similar to building a binary tree, or possibly a balanced
binary tree.  Binary tree insertion is O(N log N);  I think balanced trees
have similar complexity but I don't have a reference handy to verify this
at the moment.  The way your code does the binary search is potentially
very expensive compared to the cost of keeping the tree balanced, especially
with very large data bases.

Having said all the above, let me add that I do think your algorithm is
interesting, and it was thoughtful of you to post it.  In some situations,
your algorithm will be the superior choice, and it may also inspire some
interesting derivatives.

	Geoff Kuenning	geoff@la.locus.com	geoff@ITcorp.com

boniface@kayak.cis.ohio-state.edu (Andrew Boniface) (08/22/90)

Big O notation refers to behavior in the limit.  In algorithm analysis
this means in the limit as input length, e.g. number of items to be
sorted, approaches infinity.  Sorting n items by `binary insertion'
into a linked list requires roughly n^2 pointer followings (if done
efficiently, otherwise more n^2*log n) and log n comparisons.  If
n*{time to follow a pointer} > log n * {time to do a comparison}
then clearly the time to follow pointers dominates.

Unless the times above vary with n, then, as n goes to infinity, the
pointer followings eventually take up most of the time, yielding
O(n^2) execution time. Big O analysis is CRUDE (it will not tell you
how big n must be for the n^2 term to take over), but it is not
ambiguous.

--Andrew W Boniface

daveg@near.cs.caltech.edu (Dave Gillespie) (08/25/90)

The binary-searching insertion sort is one of those things that seem like
they ought to work, but somehow the data structures invoke Murphy's Law no
matter which way you turn.  Knuth discusses it in the case of sorting arrays
(_Art_of_Computer_Programming_, Volume III, section 5.2.1) and notes that
even though the search time is reduced to O(N log N), the time to shift
around the array elements still dominates at O(N^2).  Your suggestion of
using linked lists eliminates the time to shift the array, but now the search
is at least O(N^2) due to the list traversals.  Both the array shifting and
the pointer traversal overheads are likely to be either negliglble or
(more likely) not negligible in the same situations.

My guess is that Victor Kan's fix will indeed turn out to require shuffling
O(log N) pointers at each step, but at enough cost per pointer that the
overall method is again O(N^2).  The problem is that, if the input is
randomly distributed, each binary-search path may be wildly different
from the previous one, so the list must still be traversed a significant
distance most of the time.  I hope you can prove me wrong, Victor!

I remember seeing a clever technique in the June 1990 issue of
"Communications of the ACM" for doing binary-search-like things with
linked lists.  They augmented the linked list structure with
long-distance links.  So every element points to the next element,
some also point to the second element after them, fewer still also
point to the fourth following element, and so on.  They showed that if
you build this structure probabilistically, it's still relatively
efficient in both time and space (which can easily be traded against
each other) and also easy to manipulate.  Pretty cool stuff.

Anyhow, I've addressed followups to alt.sources.d because that seemed
a more appropriate place for the discussion.

								-- Dave
--
Dave Gillespie
  256-80 Caltech Pasadena CA USA 91125
  daveg@csvax.cs.caltech.edu, ...!cit-vax!daveg

ajz@mentor.cc.purdue.edu.UUCP (08/27/90)

Having recently taken a course where a 3 week project was based upon sorting,
here are my notes...

Here's the background, the sort was performed on four 9-digit integers where
a seperate file told which of those 4 integers was the most significant key and
which was the least.  The total time spent in the program had to be less than
4 seconds for a 4000 item list (ours was about 1 second and although .15
seconds could have shaved off of it [ie, by killing all the error checking
routines], I saw one that did it in .5 seconds).

The first thing you should know is that any textbook routine like quicksort,
mergesort, or even a radix sort failed to sort a random 4000 item list in under
4 seconds on our Gould NP-1.

This is what my team learned...
1) Traversing a linked list to locate an item is very time consuming.
2) Dumb O(n^2) routines have very little overhead making them MUCH FASTER than
   O(n log n) routines (which have high overheads) for small values of n (<15).
3) An insertion sort [O(n^2)] is one of the fastest sorts for nearly sorted
   data especially when the unsorted items are close to each other.
4) Recursive routines took more time than non-recursive ones.

The best routine I saw chain hashed the values into a 33333 item array, then
it took the array and rebuilt a linked list out of it, and finally it ran an
insertion sort on the list.  For random data, there should be very few cases
where more than 5% of the items in the rebuilt linked list will be out of 
place.  Of these out of place items, they should be uniformly distributed
within the linked list so that they don't have to be moved more than 3 spaces
to place them back into their proper position.

Since a 4000 item list will only fill the hash table to ~10%, there should be
on average only 1.1 probes needed for each item.  It should therefore take
O(1.1n + n + c) where the second "n" comes from the insertion sort having to
traverse the entire list at least once and the "c" is a constant depending upon
how many items were out of place and how far they had to be moved on the
average (shouldn't be more than 4000 * 5% * 3 = 600 or n * 5% * 3 = .15n).
It should therefore average O(2.25n), have a best case of O(2n), and have
a worst case of O(2n^2).

The routine my team wrote insertion sorted every group of 12 integers, placed
this group into an larger array, and then merge sorted this array.  This gave
a average time of O((n / 12) log (n / 12) + (12^2 / 4) * (n / 12)), a best
case of O((n / 12) log (n / 12) + n), and a worst case of O((n / 12) log
(n / 12) + (12^2 / 2) * (n / 12)).

For 4000 items...
                First routine    Second routine
average time    O(9000)          O(14793.6)
best time       O(8000)          O(6793.6)
worst time      O(3.2 * 10^7)    O(26793.6)

I should stress that the worst case for the second routine is pretty straight
forward to find, but the worst case determination of the first routine would
be quite a task.  In practice, chances are the second routine will operate
under unfavorable conditions many more times than the first routine.

-- 

T. Tim Hsu                           UUCP    ...pur-ee!ajz@mentor.cc.purdue.edu
                                     ARPA    ajz@mentor.cc.purdue.edu
FAX  1 317 494 0566                  BITNET  xajz@PURCCVM