Parallel Processing & Distributed Systems - Chapter 9: Parallel Algorithms

Introduction to parallel algorithms
development
Reduction algorithms
Broadcast algorithms
Prefix sums algorithms
30 trang | Chuyên mục: Xử Lý Song Song | Chia sẻ: dkS00TYs | Lượt xem: 1494 | Lượt tải: 2
Tóm tắt nội dung Parallel Processing & Distributed Systems - Chapter 9: Parallel Algorithms, để xem tài liệu hoàn chỉnh bạn click vào nút "TẢI VỀ" ở trên
tion
– Broadcast
– Prefix sums
 Target Architectures 
– Hypercube SIMD model
– 2D-mesh SIMD model
– UMA multiprocessor model
– Hypercube Multicomputer
-5-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 Description: Given n values a0, a1, a2…an-1, an 
associative operation ⊕, let’s use p processors 
to compute the sum:
S = a0 ⊕ a1 ⊕ a2 ⊕… ⊕ an-1
 Design strategy 1
– “If a cost optimal CREW PRAM algorithms exists
and the way the PRAM processors interact through 
shared variables maps onto the target architecture, a 
PRAM algorithm is a reasonable starting point”
-6-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
a0
j=0
j=1
j=2
a1 a2 a3 a4 a5 a6 a7
P0
P0
P0
P1 P2 P3
P2
 Cost optimal PRAM algorithm complexity: 
O(logn) (using n div 2 processors) 
 Example for n=8 and p=4 processors
-7-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Cost Optimal PRAM Algorithm for 
the Reduction Problem(cont’d)
Using p= n div 2 processors to add n numbers:
Global a[0..n-1], n, i, j, p;
Begin
spawn(P0, P1,… ,,Pp-1);
for all Pi where 0 ≤ i ≤ p-1 do
for j=0 to ceiling(logp)-1 do
if i mod 2j =0 and 2i + 2j < n then
a[2i] := a[2i] ⊕ a[2i + 2j];
endif; 
endfor j;
endforall;
End.
Notes: the processors communicate in a biominal-tree pattern
-8-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Solving Reducing Problem on 
Hypercube SIMD Computer
P0
P1 P3
P2
P4
P5
P6
P7
Step 1: 
Reduce by dimension j=2
Step 2: 
Reduce by dimension j=1
P0
P1 P3
P2
P0
P1
Step 3: 
Reduce by dimension j=0
The total sum will be at P0
-9-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Solving Reducing Problem on 
Hypercube SIMD Computer (cond’t)
Using p processors to add n numbers ( p << n)
Global j;
Local local.set.size, local.value[1..n div p +1], sum, 
tmp;
Begin
spawn(P0, P1,… ,,Pp-1);
for all Pi where 0 ≤ i ≤ p-1 do
if (i < n mod p) then local.set.size:= n div p + 1
else local.set.size := n div p;
endif;
sum[i]:=0;
endforall;
Allocate 
workload for 
each 
processors
-10-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Solving Reducing Problem on 
Hypercube SIMD Computer (cond’t)
for j:=1 to (n div p +1) do
for all Pi where 0 ≤ i ≤ p-1 do
if local.set.size ≥ j then 
sum[i]:= sum ⊕ local.value [j];
endforall;
endfor j;
Calculate the 
partial sum for 
each processor
-11-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Solving Reducing Problem on 
Hypercube SIMD Computer (cond’t)
for j:=ceiling(logp)-1 downto 0 do
for all Pi where 0 ≤ i ≤ p-1 do
if i < 2j then
tmp := [i + 2j]sum;
sum := sum ⊕ tmp;
endif;
endforall;
endfor j;
Calculate the total 
sum by reducing 
for each 
dimension of the 
hypercube
-12-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 A 2D-mesh with p*p processors need at least 2(p-1) steps to 
send data between two farthest nodes
 The lower bound of the complexity of any reduction sum 
algorithm is 0(n/p2 + p)
Example: a 4*4 mesh 
need 2*3 steps to get 
the subtotals from the 
corner processors
-13-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Solving Reducing Problem on 
2D-Mesh SIMD Computer(cont’d)
 Example: compute the total sum on a 4*4 mesh
Stage 1
Step i = 3
Stage 1
Step i = 2
Stage 1
Step i = 1
-14-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Solving Reducing Problem on 
2D-Mesh SIMD Computer(cont’d)
 Example: compute the total sum on a 4*4 mesh
Stage 2
Step i = 3
Stage 2
Step i = 2
Stage 2
Step i = 1
(the sum is at P1,1)
-15-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Solving Reducing Problem on 
2D-Mesh SIMD Computer(cont’d)
Summation (2D-mesh SIMD with l*l processors
Global i;
Local tmp, sum;
Begin
{Each processor finds sum of its local value 
code not shown} 
for i:=l-1 downto 1 do
for all Pj,i where 1 ≤ i ≤ l do
{Processing elements in colum i active}
tmp := right(sum);
sum:= sum ⊕ tmp;
end forall;
endfor;
Stage 1: 
Pi,1 computes 
the sum of all 
processors in 
row i-th
-16-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Solving Reducing Problem on 
2D-Mesh SIMD Computer(cont’d)
for i:= l-1 downto 1 do
for all Pi,1 do
{Only a single processing element active}
tmp:=down(sum);
sum:=sum ⊕ tmp;
end forall;
endfor;
End.
Stage2:
Compute the 
total sum and 
store it at P1,1
-17-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Solving Reducing Problem on 
UMA Multiprocessor Model(MIMD)
 Easily to access data like PRAM
 Processors execute asynchronously, so we must ensure 
that no processor access an “unstable” variable
 Variables used: 
Global a[0..n-1], {values to be added}
p, {number of proeessor, a power of 2}
flags[0..p-1], {Set to 1 when partial sum available}
partial[0..p-1], {Contains partial sum}
global_sum; {Result stored here}
Local local_sum;
-18-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Solving Reducing Problem on 
UMA Multiprocessor Model(cont’d)
 Example for UMA multiprocessor with p=8 processors
P0 P1 P2 P3 P4 P5 P6 P7
Step j=8
Stage 2
Step j=4
Step j=2
Step j=1 The total sum is at P0
-19-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Solving Reducing Problem on UMA 
Multiprocessor Model(cont’d)
Summation (UMA multiprocessor model)
Begin
for k:=0 to p-1 do flags[k]:=0;
for all Pi where 0 ≤ i < p do
local_sum :=0;
for j:=i to n-1 step p do 
local_sum:=local_sum⊕ a[j];
Stage 1: 
Each processor 
computes the 
partial sum of n/p
values
-20-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Solving Reducing Problem on UMA 
Multiprocessor Model(cont’d)
j:=p; 
while j>0 do begin
if i ≥ j/2 then
partial[i]:=local_sum;
flags[i]:=1;
break;
else
while (flags[i+j/2]=0) do;
local_sum:=local_sum⊕ partial[i+j/2];
endif;
j=j/2;
end while;
if i=0 then global_sum:=local_sum;
end forall;
End.
Each processor 
waits for the partial 
sum of its partner 
available
Stage 2:
Compute the total sum
-21-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Solving Reducing Problem on UMA 
Multiprocessor Model(cont’d)
 Algorithm complexity 0(n/p+p)
 What is the advantage of this algorithm compared 
with another one using critical-section style to 
compute the total sum?
 Design strategy 2:
– Look for a data-parallel algorithm before considering a 
control-parallel algorithm
 On MIMD computer, we should exploit both data 
parallelism and control parallelism 
(try to develop SPMD program if possible)
-22-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 Description:
– Given a message of length M stored at one processor, 
let’s send this message to all other processors 
 Things to be considered:
– Length of the message
– Message passing overhead and data-transfer time
-23-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 If the amount of data is small, the best algorithm takes logp
communication steps on a p-node hypercube
 Examples: broadcasting a number on a 8-node hypercube
P0
P1 P3
P2
P4
P5
P6
P7
Step 3: 
Send the number via the 
3rd dimension of the 
hypercube
Step 2: 
Send the number via the 
2nd dimension of the 
hypercube
P0
P1 P3
P2
P0
P1
Step 1: 
Send the number via the 
1st dimension of the 
hypercube
-24-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Broadcasting a number from P0 to all other processors
Local i, {Loop iteration}
p, {Partner processor}
position; {Position in broadcast tree}
value; {Value to be broadcast}
Begin
spawn(P0, P1,… ,,Pp-1);
for j:=0 to logp-1 do
for all Pi where 0 ≤ i ≤ p-1 do
if i < 2j then
partner := i+2j;
[partner]value:=value;
endif;
endforall;
end forj;
End.
-25-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 The previous algorithm 
– Uses at most p/2 out of plogp links of the hypercube
– Requires time Mlogp to broadcast a length M msg
not efficient to broadcast long messages
 Johhsson and Ho (1989) have designed an 
algorithm that executes logp times faster by:
– Breaking the message into logp parts
– Broadcasting each parts to all other nodes through a 
different biominal spanning tree
-26-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 Time to broadcast a msg of length M is Mlogp/logp = M
 The maximum number of links used simultaneously is 
plogp, much greater than that of the previous algorithm
A
B
C
C
A
B
A
B
C B
C
A
C
A
B
A
A
B B
C
C
-27-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Johnsson and Ho’s Broadcast Algorithm 
on Hypercube SIMD(cont’d)
Design strategy 3
– As problem size grow, use the algorithm that 
makes best use of the available resources
-28-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 Description:
– Given an associative operation ⊕ and an array A 
containing n elements, let’s compute the n quantities
 A[0]
 A[0] ⊕ A[1]
 A[0] ⊕ A[1] ⊕ A[2]
 …
 A[0] ⊕ A[1] ⊕ A[2] ⊕… ⊕ A[n-1]
 Cost-optimal PRAM algorithm: 
– ”Parallel Computing: Theory and Practice”, section 2.3.2, p. 32
-29-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 Finding the prefix sums of 16 values
Processor 0
3 2 7 6
18
18 35 43 62
3 5 12 18
(a)
(b)
(c)
(d)
Processor 1
0 5 4 8
17
18 35 43 62
18 23 27 35
Processor 2
2 0 1 5
8
18 35 43 62
37 37 38 43
Processor 3
2 3 8 6
19
18 35 43 62
45 48 56 62
-30-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 Step (a)
– Each processor is allocated with its share of values
 Step (b)
– Each processor computes the sum of its local elements
 Step (c)
– The prefix sums of the local sums are computed and 
distributed to all processor
 Step (d)
– Each processor computes the prefix sum of its own 
elements and adds to each result the sum of the values 
held in lower-numbered processors
File đính kèm:
Parallel Processing & Distributed Systems - Chapter 9 Parallel Algorithms.pdf