bohn@imada.dk (Lars Bohn Hansen) (01/24/91)
Parallel Grouped Back-Propagation (GPBP). ----------------------------------------- The authors of this article are two students of computer science at the University of Odense. We are currently working on a project concerning GPBP. If you haven't heard that particular term, don't worry, we invented it ourselves, as a name for a method suggested by our teacher. We submit this article in the hope that someone out there may be able to tell us whether the idea IS indeed all new, and whether it holds any possibilities of being an efficient solution to parallellizing back-propagation. Parallellizing back-propagation (BP). ------------------------------------- The trouble in parallelizing BP on a transputer system(1), is that you have relatively few, but very powerful units, where neural networks call for many very simple units. In other words, there is much communication between units, and not much calculation in a specific unit. The main goal when parallellizing an algorithm, is to keep communication at a minimum, since all time spent communicating is in essence wasted. The efficiency of a parallel program is measured as the speedup relative to a non-parallel version of the same program. Since no communication is needed in the non-parallel version, all communication is wasted. The exception is that communication can be done in parallel with the calculations(2), on each transputer. This of course requires careful structuring of the algorithms. An immediately obvious solution to parallellizing BP is to cut the neural network into slices (one slice for each transputer) with each slice containing an equal number of input, hidden and output units. This is a typical parallelization approach, and succumbs to all of the problems mentioned above. Our approach. ------------- Our approach is based on two steps: 1. Grouping. 2. parallelizing. Step 2 will only work if step 1 works. Step 1: Grouping. ----------------- There are normally two ways of presenting training data (patterns) to a neural network. Either the weights are updated after each pattern (what Rummelhart & McClelland call training by pattern), or the weight edits are accumulated and the weights updated only after all patterns have been through the network (training by epoch). The latter method is the true gradient descent method, whereas the former follows a great many different gradients which will eventually lead to a minimum for all patterns. The last method seems strangely enough to be the most efficient, and this is accredited to the fact that by not following the true gradient, it avoids getting stuck in local minima. The suggestion made by our teacher was to try an approach in between, ie. update the weights after a set number of patterns. This may or may not lead to an improvement in learning ability, but the important thing is (as we will show in the next section) that if this method does not actually worsen the learning speed, it may serve as the foundation for a new way of parallellizing BP. Tests: The simple tests we have conducted so far imply that grouped training is, at the least, not a bad idea. We have conducted tests on training sets with a maximum of 12 patterns (not very many, we know) and they all show that the number of iterations needed to train the network to a specified tolerance is either constant, or rises only slightly, as the patterns are grouped more and more. This may not hold for very large training sets, we'll see. Step 2: Parallelizing. ---------------------- With the input grouped as under step1 it is now possible to train each pattern in a group on a separate transputer, since there is no interaction during the process of training patterns from the same group. Each transputer then calculates a weight-edit, and all these edits must be summed to reach the total weight-edit. We have arranged the transputers in a circular structure, so that each transputer shifts its weight-edits to the right, and receives the other weight-edits from the left. This minimizes communication, as opposed to a master/slave structure, where a master receives the weight-edits from each slave-unit, sums them up, and transmits the new weights back to each transputer. This is a rather loose description of our approach, but the whole algorithm is too complex to describe here. Providing that the communication does not outweigh the calculation (we still haven't run tests on that), it is possible to overlap most of the communication. Adaptive BP. ------------ A final idea: Adaptive BP is a method that has shown itself to be immensely superior to classic BP. This method however works only when you train by epoch. If however it also works with grouped training, it can be combined with the methods mentioned above. We would like to hear some ideas regarding this project. It is fairly (almost tantalizingly) simple, so why hasn't anybody thought of it before? Or maybe someone has, and gave it up because it's no good? Also we would be glad to answer any questions (if this method holds anybodys interest but ours, that is). Footnotes. ---------- (1) We are using a Meiko transputer system with 48 transputers, each capable of 1.6 Mflops. (2) At least on the Meiko it is, since it has separate processors handling i/o. Authors: Alexis Kjerulf & Lars Bohn Hansen E-mail alexis@imada.dk