[comp.ai.neural-nets] parallel BP on a transputer system

bohn@imada.dk (Lars Bohn Hansen) (01/24/91)

Parallel Grouped Back-Propagation (GPBP).
-----------------------------------------

The authors of this article are two students of computer science at
the University of Odense. We are currently working on a project
concerning GPBP. If you haven't heard that particular term, don't
worry, we invented it ourselves, as a name for a method suggested by
our teacher. We submit this article in the hope that someone out there
may be able to tell us whether the idea IS indeed all new, and whether
it holds any possibilities of being an efficient solution to
parallellizing back-propagation.

Parallellizing back-propagation (BP).
-------------------------------------

The trouble in parallelizing BP on a transputer system(1), is that you
have relatively few, but very powerful units, where neural networks
call for many very simple units. In other words, there is much
communication between units, and not much calculation in a specific
unit. The main goal when parallellizing an algorithm, is to keep
communication at a minimum, since all time spent communicating is in
essence wasted. The efficiency of a parallel program is measured as
the speedup relative to a non-parallel version of the same program.
Since no communication is needed in the non-parallel version, all
communication is wasted. The exception is that communication can be
done in parallel with the calculations(2), on each transputer. This of
course requires careful structuring of the algorithms.  An immediately
obvious solution to parallellizing BP is to cut the neural network
into slices (one slice for each transputer) with each slice containing
an equal number of input, hidden and output units. This is a typical
parallelization approach, and succumbs to all of the problems
mentioned above.


Our approach.
-------------

Our approach is based on two steps:

    1.  Grouping.
    2.  parallelizing.

Step 2 will only work if step 1 works.

Step 1: Grouping.
-----------------

There are normally two ways of presenting training data (patterns) to
a neural network. Either the weights are updated after each pattern
(what Rummelhart & McClelland call training by pattern), or the weight
edits are accumulated and the weights updated only after all patterns
have been through the network (training by epoch). The latter method
is the true gradient descent method, whereas the former follows a
great many different gradients which will eventually lead to a minimum
for all patterns. The last method seems strangely enough to be the
most efficient, and this is accredited to the fact that by not
following the true gradient, it avoids getting stuck in local minima.
The suggestion made by our teacher was to try an approach in between,
ie. update the weights after a set number of patterns. This may or may
not lead to an improvement in learning ability, but the important
thing is (as we will show in the next section) that if this method
does not actually worsen the learning speed, it may serve as the
foundation for a new way of parallellizing BP.

Tests: The simple tests we have conducted so far imply that grouped
training is, at the least, not a bad idea. We have conducted tests on
training sets with a maximum of 12 patterns (not very many, we know)
and they all show that the number of iterations needed to train the
network to a specified tolerance is either constant, or rises only
slightly, as the patterns are grouped more and more. This may not hold
for very large training sets, we'll see.


Step 2: Parallelizing.
----------------------

With the input grouped as under step1 it is now possible to train each
pattern in a group on a separate transputer, since there is no
interaction during the process of training patterns from the same
group. Each transputer then calculates a weight-edit, and all these
edits must be summed to reach the total weight-edit.  We have arranged
the transputers in a circular structure, so that each transputer
shifts its weight-edits to the right, and receives the other
weight-edits from the left. This minimizes communication, as opposed
to a master/slave structure, where a master receives the weight-edits
from each slave-unit, sums them up, and transmits the new weights back
to each transputer.  This is a rather loose description of our
approach, but the whole algorithm is too complex to describe here.
Providing that the communication does not outweigh the calculation (we
still haven't run tests on that), it is possible to overlap most of
the communication.


Adaptive BP.
------------

A final idea: Adaptive BP is a method that has shown itself to be
immensely superior to classic BP. This method however works only when
you train by epoch. If however it also works with grouped training, it
can be combined with the methods mentioned above.


We would like to hear some ideas regarding this project. It is fairly
(almost tantalizingly) simple, so why hasn't anybody thought of it
before? Or maybe someone has, and gave it up because it's no good?

Also we would be glad to answer any questions (if this method holds
anybodys interest but ours, that is).



Footnotes.
----------

(1) We are using a Meiko transputer system with 48 transputers, each
capable of 1.6 Mflops.

(2) At least on the Meiko it is, since it has separate processors
handling i/o.



Authors: Alexis Kjerulf & Lars Bohn Hansen

E-mail alexis@imada.dk