SIMD Parallel computing
63
|
The Art of Multiprocessor Programming
Price: $41.52
List Price: $69.95 |
|
Parallel Processing and Parallel Algorithms: Theory and Computation
Price: $81.08
List Price: $139.00 |
|
Fundamentals of Parallel Processing
Price: $60.59
List Price: $110.00 |
|
Connectionism and the Mind: Parallel Processing, Dynamics, and Evolution in Networks
Price: $6.00
List Price: $51.95 |
At one very fulfilling time in my career, 8 years in total, I developed code on a Distributed Array Processor. (DAP). This was a grid of 64 by 64 single bit processors with nearest neighbour data connections. This meant each processor could access it's 4 nearest neighbours. For example the content of each of the 4096 processors could be added to the content of, say its northern neighbour in one parallell operation.
Here I give a few example of how this architecture can give more speedup than expected. It is not a comprehensive review of techniques, and I plan to include a few more advanced methods in a second article. While SIMD parallelism is not fashionable these days I think this architecture offers potential and the ideas involved should not be lost entirely. In any case tI hope to offer a glimpse into an interesting byway of technology.
History
The original DAP was developed by ICL and the academic part of the research was done at Queen Mary College London. While I was there ICL lost interest in the DAP and a group of academics and ex ICL employees founded Active Memory Technology (AMT) to develop the commercial potential of the DAP. The main competitor of the DAP was the Thinking Machines Connection Machine and AMT had legal battles with them over alleged patent infringements by CM.
In the end both companies failed and AMT was, as I recall, taken over by an American company called Masspar. When I heard this I confidently predicted it would not be long before the DAP was introduced into the UK as a marvellous AMERICAN invention and I was not disappointed.
The Connection Machine, with 265 by 256 processors had a catwalk with a handrail that allowed you to walk around the top. I saw a photograph of the CM that showed the entire machine and the catwalk looked so small it was almost invisible. By comparison the AMT 64 by 64 DAP was the size of a four drawer filing cabinet and later versions were about the size of a can of soft drink.
The CM used a parallel version of C and the DAP used a parallel version of Fortran, which, I believe, influenced the standard for Fortran 90. The DAP, and I believe the CM, were both Batch machines with a compiled program sent from a host to the DAP to be run there. Later versions of the operating system allowed a limited degree of interactivity. I believe both machine were inspired by the ILLIAC IV
Spatial Averaging
Here I use a pseudocode based on DAP Fortran. For example to replace each processor with the average of its 4 nearest neighbours the code would be
(1) X = 0.25*(SHEP(X) + SHWP(X) + SHSP(X) + SHNP(X))
Where for example SHEP(X) meant “Place the content of each processor in the processor to the east and pad the edges with zero. Cyclic boundary conditions were also possible and the corresponding command was SHEC(X).
Let's look at (1). In (1) an averaging is performed in 4 parallel additions. On a 64 by 64 DAP and 64 bit data assume the addition takes 64 times longer on a one bit processor than on a conventional processor. The result will still be 16 time faster than with a Von Neumann Architecture. In fact adding two numbers with a single bit processor was quicker than I assumed so the speedup will be greater.
More efficiency can be obtained when averaging content of all all eight neighbours with the following line of code
(2)X = X + SHEP(X) + SHWP(X)
X = X + SHSP(X) + SHNP(X)
X= 0.125*X
A total of only four parallel additions and one division at each Processing element (PE)\
To see what happened look at 3 by 3 patch of Processors with the following content
1 2 3
4 5 6
7 8 9
For the central column applying the first line gives
1 +2 + 3
4 + 5 + 6
7 + 8 + 9
and applying the second line to the intermediate result from the first line gives the central sum
1 +2 + 3 + 4 + 5 + 6 + 7 + 8 + 9
For completeness I should mention that the DAP provided the facility to specify which Processors would carry out an operation, for example
(3) X(x>3)=3
Is an instruction to clip only those Processors that hold a number larger than 3. A Processor holding a 2 will not be affected.
|
|
NEW Languages and Compilers for Parallel Computing: ...
Current Bid: $95.00
|
|
|
NEW Parallel Computing in Structural Engineering - K...
Current Bid: $106.28
|
|
|
NEW Parallel Computing Technologies: 6th Internation...
Current Bid: $89.95
|
|
|
NEW Parallel Computing 1988: Shell Conference, Amste...
Current Bid: $37.95
|
Adding numbers
The DAP allowed a vector view of data. The array could be treated like a chain of 4096 processors.
This allowed the contents of the 4096 PEs to be summed in 12 parallel additions. To see how this could be done look a a chain of 8 Processors holding the values
1 2 3 4 5 6 7 8
first, to each each processor add the one on its right with planar boundary conditions
(1+2 ) (2+3) (3+4) (4 + 5) (5+6) (6 +7) (7+8) (8)
Now to each processor the one two places to the right
(1+2+ 3+4) ( 2+3+ 4 + 5) (3+4 + 5+6) ( 4 + 5 + 5+6) (5+6 +7+8) (6+7 + 7+8) (7+8) (8)
Now to each PE add the one 4 place to the right The first PE now holds
1+2+ 3+4 + 5+6 +7+8
In general N number can be summed in O(Log2(n)) parallel additions.
Interesting Applications
While the DAP was mostly seen as useful only for image based applications like noise removal, optic flow matching and other computer vision algorithms it was used for many other things, for example a colleague developed Prolog for the DAP and others developed sorting algorithms, while I myself developed a fast Fourier Transform algorithm and worked for three years developing Cellular Automaton models of Polymers.
Wrapping up
I have here only given the briefest flavour of the elegant ways in which this architecture can speed up calculations. With all this I think I can be forgiven for considering the recent introduction of very restricted SIMD by Microsoft and others as a pale imitation. Four object not able to communicate with each other indeed.
Or am I missing something in all the VISIO diagrams I see on the web that appear to be deigned to make a simple idea complicated and incomprehensible.
PrintShare it! — Rate it: up down flag this hub









