[comp.lang.prolog] the great split up

eiverson@nmsu.edu (Eric Iverson) (02/12/90)

Here is a split routine I wrote.  Think of it as one string acting as
a catalyst to split another string in two.  If you can think of a
better, more efficient way to do it, please let me know.  This is for
Quintus Prolog and is going to be called many many times.


% split(Cat,String,A,B) binds Cat onto String and splits it into A and B.
% String is split at the center of Cat.
split([C|C1],[S|S1],[S|A],B):-
    C == S,split2(C1,S1,A,B,1,_);
    split([C|C1],S1,A,B).

split2([],X,[],X,N,H):-
    H is N-N//2,!.

split2([C|C1],[C|S1],A,B,N,H):-
    N1 is N+1,!,
    split2(C1,S1,A1,B1,N1,H),
    (N1 > H,B=[C|B1],A=A1,!;
     A=[C|A1],B=B1),!.

--
------------------------------------------------------------------------
Another Gruntpig production, in association with the Rat Lab Steamworks.

I want to kill everyone here with a cute colorful Hydrogen Bomb!!

-Zippy the Pinhead

ok@goanna.oz.au (Richard O'keefe) (02/12/90)

In article <EIVERSON.90Feb11221732@hades.nmsu.edu>, eiverson@nmsu.edu (Eric Iverson) writes:
(Here is a "split" routine, how can it be done better in Quintus Prolog?) 

Unfortunately, calling it a "split" routine says very little.  It is not
at _all_ clear to me what this thing is supposed to do.  Before we start,
a coding tip:  *NEVER* put a semicolon at the end of a line, it looks so
much like a comma that it is confusing.  Treat ";" like "else" in Ada.

As far as I can see, split(Cat, Str, A, B) buzzes around copying elements
from Str to A until it finds an element which is identical (==) to the 1st
element of Cat.  It then hands over to split2, which demands that the rest
of Cat should follow in Str.  split2 is determinate (in fact it has rather
more cuts than it needs, but that is another story).  split/4 was not
determinate, however, so if the first element of Cat is not followed by
the rest of Cat in Str, split/4 will try the second disjunct, thus
looking for another occurrence of the first element of Cat in Str.

It follows that
	if split(Cat, Str, A, B) succeeds
	then there are lists Alpha, Omega such that
	Str = Alpha @ Cat @ Omega
	and A = Alpha @ <something>

How this constitutes "binding Cat onto String" is beyond me; note that
the use of the == test means that Cat and Str should both be bound before
split/4 is called.

After much puzzling, I finally decided that the intent is that
	split(Cat, Str, A, B)
	is to be true when Cat, Str, A, and B are all lists, and
	there exist lists Alpha, Omega such that
	(1) Str = Alpha @ Cat @ Omega
	(2) A = Alpha @ take(floor(N/2), Cat)
	(3) B = drop(floor(N/2), Cat) @ Omega
where @ is concatenation and take, drop are as in APL.

The first observation is that this is a rather strange thing to do.
What is it to be used for?  If we knew that, perhaps some other data
structure entirely might be a better choice.  We might also know then
whether all of the logical results were needed or whether the first
would suffice (for example, it might be known that Cat occurs precisely
once in Str).

The second observation is that the search is intrinsically an
O(|Cat|.|Str|) operation if we use the naive string search algorithm;
since we're using lists it would be difficult to use the Boyer-Moore
algorithm, but it might be possible to adapt the Knuth-Morris-Pratt
one.  Does anyone know of any published results on the complexity of
string search in linked lists?  This may or may not matter, depending
on how big Cat is and how big the alphabet size is.   (Can anyone tell
me where to find Galil's algorithm?  I had a copy but lost it in a move.)

The third observation is that we can precompute the back of A and the
front of B because we _know_ what they must be like: they must be like
Cat.  The trick I use is to work with a second copy of the list and
walk down it two steps at a time for every one step we take down the
original; when we've reached the end of the second copy we're half way
down the first.

%   halve(Counter, List, Front, Back) is true when
%   append(Front, Back, List) and
%   length(Front) = floor(length(Counter)/2)
%   Counter must be a proper list but the others may be uninstantiated.
%   (this can be speeded up by unrolling; see library(basics) for some
%   examples of that technique).

halve([_,_|N], [X|List], [X|Front], Back) :- !,
	halve(N, List, Front, Back).
halve([_], List, [], List).
halve([],  List, [], List).

Then we can encode split/4 directly:

split(Cat, Str, A, B) :-
    /*  precompute the last part of A (determinate)  */
	halve(Cat, Cat, FrontOfCat, BackOfCat),
    /*  precompute the front part of B (determinate)  */
	append(BackOfCat, Omega, B),
    /*  enumerate candidates for Alpha (non-determinate)  */
	append(Alpha, Cat_Omega, Str, FrontOfCat, A),
    /*  check that Cat follows Alpha in Str (determinate)  */
	append(Cat, Omega, Cat_Omega).

where append/5 is, as in library(lists):

append([], R, R, S, S).
append([H|T], L, [H|R], M, [H|S]) :-
	append(T, L, R, M, S).


I have _not_ timed this, because I have no idea what realistic data
would look like, and for all I know it might well be slower.  But I
think it's a wee bit clearer (append/5 has been in the library for
years and halve/4 should have been, it's a known technique).
It's interesting that this version obviously reduces to
	append(A, B, Str)
when Cat is empty, which seems appropriate.

We can optimise this split/4 by making the test happen a little sooner.
(The original posting had done this, at the price of forcing Cat to be
non-empty always.)
Here's how:

We start from the specification
	foo(Target, Source, Alpha, Omega) :-
		append(Alpha, Target_Omega, Source),
		append(Target, Omega, Target_Omega).
special casing several lengths of Target and unfolding gives us

foo([], S, A, C) :- !,
	append(A, C, S).
foo([X], S, A, C) :- !,
	append(A, [X|C], S).
foo([X,Y], S, A, C) :- !,
	append(A, [X,Y|C], S).
foo([X,Y,Z], S, A, C) :- !,
	append(A, [X,Y,Z|C], S).
foo([X,Y,Z|L], S, A, C) :-
	append(A, [X,Y,Z|LC], S),
	append(L, C, LC).

I leave it as an exercise for the reader to apply this optimisation to
split/4.  (For the known length cases we can optimise the first two goals
of split/4 away completely.  How much this helps depends on how common
Cats of length < 4 are.)  

Again, it all depends on what the actual data are like, and it seems more
than likely to me that the best way to do this operation is to use some
other data structure so th at it doesn't _need_ to be done.

eiverson@nmsu.edu (Eric Iverson) (02/12/90)

In article <2858@goanna.oz.au> ok@goanna.oz.au (Richard O'keefe) writes:

> After much puzzling, I finally decided that the intent is that
>	 split(Cat, Str, A, B)
>	 is to be true when Cat, Str, A, and B are all lists, and
>	 there exist lists Alpha, Omega such that
>	 (1) Str = Alpha @ Cat @ Omega
>	 (2) A = Alpha @ take(floor(N/2), Cat)
>	 (3) B = drop(floor(N/2), Cat) @ Omega
> where @ is concatenation and take, drop are as in APL.
>
> The first observation is that this is a rather strange thing to do.
> What is it to be used for?  If we knew that, perhaps some other data
> structure entirely might be a better choice.  We might also know then
> whether all of the logical results were needed or whether the first
> would suffice (for example, it might be known that Cat occurs precisely
> once in Str).

Sorry for the confusion.  Here's some data:

| ?- split([c,d],[a,b,c,d,e,a,b,c,d,e,a,b,c],A,B).

A = [a,b,c],
B = [d,e,a,b,c,d,e,a,b,c] ;

A = [a,b,c,d,e,a,b,c],
B = [d,e,a,b,c] ;

no
| ?- split([c,d,e],[a,b,c,d,e,a,b,c,d,e,a,b,c],A,B).

A = [a,b,c,d],
B = [e,a,b,c,d,e,a,b,c] ;

A = [a,b,c,d,e,a,b,c,d],
B = [e,a,b,c] ;


I just want the first string to find copies of itself in the second
string and then split the second string at the center of that copy.
Hope this clears things up.

--
------------------------------------------------------------------------
Another Gruntpig production, in association with the Rat Lab Steamworks.

I want to kill everyone here with a cute colorful Hydrogen Bomb!!

-Zippy the Pinhead

lee@munnari.oz.au (Lee Naish) (02/13/90)

In article <EIVERSON.90Feb11221732@hades.nmsu.edu> eiverson@nmsu.edu (Eric Iverson) writes:
>
>Here is a split routine I wrote.  Think of it as one string acting as
>a catalyst to split another string in two.  If you can think of a
>better, more efficient way to do it, please let me know.  This is for
>Quintus Prolog and is going to be called many many times.
>
% split(Cat,String,A,B) binds Cat onto String and splits it into A and B.
% String is split at the center of Cat.
	% eg split([a,b], [x,a,b,z], [x,a], [b,z])
split(Cat, String, A, B) :-
	built_template(Cat, TS, TA, TB),
	match_template(String, TS, TA, TB, A, B).

	% uses templates to match string and return
	% bits which are wanted
	% if TS matches with (front of) String
	%    TA is (tail of) A, and
	%    TB is B
match_template(TS, TS, TA, TB, TA, TB).
match_template(S.String, TS, TA, TB, S.A, B) :-
	match_template(String, TS, TA, TB, A, B).

	% builds templates from Cat, for use above
	% eg, if Cat = a.b.c.d.[]
	%	TS = a.b.c.d.Tail,
	%	TA = a.b.[],
	%	TB = c.d.Tail
built_template(Cat, TS, TA, TB) :-
	half_length(Cat, HLen),
	built_template_len(HLen, Cat, TS, TA, TB).

	% as above but has length to split Cat
built_template_len(0, Cat, Tail, [], Tail).
built_template_len(s(HLen), C.Cat, C.TS, C.TA, TB) :-
	built_template_len(HLen, Cat, TS, TA, TB).

	% returns half length of Cat (rounded up)
	% uses s(s(0)) notation
half_length([], 0).
half_length(C.Cat, s(HLen)) :-
	half_length_1(Cat, HLen).

	% as above, but rounds down
half_length_1([], 0).
half_length_1(C.Cat, HLen) :-
	half_length(Cat, HLen).


The code is longer than the previous version, but should be faster
and use less space.

The computation of half the length of Cat is done only once.
This computation uses a constant amount of stack space (compared with
O(Length_of_Cat) in the previous version) since it only uses tail
recursion.  No choice points are created at this stage (this is possible
because the length is represented using successor notation, rather than
Prolog integers and hence clause indexing can be used), and hence no cuts
are needed (the previous version had some unnecessary cuts also).

The matching process, and construction of A and B, is all done by a
single unification (with nice use of a logic variable in the templates),
instead of several procedure calls.

	lee

lee@munnari.oz.au (Lee Naish) (02/13/90)

In article <3176@munnari.oz.au> I wrote:
>built_template(Cat, TS, TA, TB) :-
>	half_length(Cat, HLen),
>	built_template_len(HLen, Cat, TS, TA, TB).

As Richard has pointed out, its possible to halve things without
building a separate representation of half the length.  This is how
build_template (built was a typo) can be coded:

        % this version uses halving technique pointed out
        % by ROK - Cat is uses as a representation of its
        % length, as well as being used for its elements
build_template(Cat, TS, TA, TB) :-
        build_template_len(Cat, Cat, TS, TA, TB).
 
build_template_len([], Cat, TS, [], TS) :-
       append(Cat, _Tail, TS).
build_template_len([_], C.Cat, C.TS, [C], TB) :-
       append(Cat, _Tail, TS).
build_template_len(_._.Len, C.Cat, C.TS, C.TA, TB) :-
       build_template_len(Len, Cat, TS, TA, TB).

The disadvantage of this coding is that with first argument, top level
functor indexing only, a choice point is created for each call (except
the last, for even length lists), and a choice point is left on the
stack for odd length lists.

The extra choice point can be removed by adding a cut to the second clause,
(as Richard did), but the code is still inefficient.

If your Prolog system allows fancier indexing (so the second and third
clauses can be distinguished) then no choice points are created, and the
cut is unnecessary.  In NU-Prolog, the follwing when declarations will
do the trick:

?- build_template_len([], _, _, _, _) when ever.
?- build_template_len(_.Len, _, _, _, _) when Len.

In systems without fancy indexing its best to rewrite the code so there
are two procedures, each one being indexed properly:

build_template_len([], Cat, TS, [], TS) :-
       append(Cat, _Tail, TS).
build_template_len(_.Len, C.Cat, C.TS, C.TA, TB) :-
       build_template_len_1(Len, Cat, TS, TA, TB).

build_template_len_1([], Cat, TS, [], TS) :-
       append(Cat, _Tail, TS).
build_template_len_1(_.Len, Cat, TS, TA, TB) :-
       build_template_len(Len, Cat, TS, TA, TB).

	lee