Parallel Processing & Distributed Systems - Chapter 9: Parallel Algorithms
Introduction to parallel algorithms
development
Reduction algorithms
Broadcast algorithms
Prefix sums algorithms
Tóm tắt nội dung Parallel Processing & Distributed Systems - Chapter 9: Parallel Algorithms, để xem tài liệu hoàn chỉnh bạn click vào nút "TẢI VỀ" ở trên
tion – Broadcast – Prefix sums Target Architectures – Hypercube SIMD model – 2D-mesh SIMD model – UMA multiprocessor model – Hypercube Multicomputer -5-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Description: Given n values a0, a1, a2…an-1, an associative operation ⊕, let’s use p processors to compute the sum: S = a0 ⊕ a1 ⊕ a2 ⊕… ⊕ an-1 Design strategy 1 – “If a cost optimal CREW PRAM algorithms exists and the way the PRAM processors interact through shared variables maps onto the target architecture, a PRAM algorithm is a reasonable starting point” -6-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM a0 j=0 j=1 j=2 a1 a2 a3 a4 a5 a6 a7 P0 P0 P0 P1 P2 P3 P2 Cost optimal PRAM algorithm complexity: O(logn) (using n div 2 processors) Example for n=8 and p=4 processors -7-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Cost Optimal PRAM Algorithm for the Reduction Problem(cont’d) Using p= n div 2 processors to add n numbers: Global a[0..n-1], n, i, j, p; Begin spawn(P0, P1,… ,,Pp-1); for all Pi where 0 ≤ i ≤ p-1 do for j=0 to ceiling(logp)-1 do if i mod 2j =0 and 2i + 2j < n then a[2i] := a[2i] ⊕ a[2i + 2j]; endif; endfor j; endforall; End. Notes: the processors communicate in a biominal-tree pattern -8-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Solving Reducing Problem on Hypercube SIMD Computer P0 P1 P3 P2 P4 P5 P6 P7 Step 1: Reduce by dimension j=2 Step 2: Reduce by dimension j=1 P0 P1 P3 P2 P0 P1 Step 3: Reduce by dimension j=0 The total sum will be at P0 -9-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Solving Reducing Problem on Hypercube SIMD Computer (cond’t) Using p processors to add n numbers ( p << n) Global j; Local local.set.size, local.value[1..n div p +1], sum, tmp; Begin spawn(P0, P1,… ,,Pp-1); for all Pi where 0 ≤ i ≤ p-1 do if (i < n mod p) then local.set.size:= n div p + 1 else local.set.size := n div p; endif; sum[i]:=0; endforall; Allocate workload for each processors -10-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Solving Reducing Problem on Hypercube SIMD Computer (cond’t) for j:=1 to (n div p +1) do for all Pi where 0 ≤ i ≤ p-1 do if local.set.size ≥ j then sum[i]:= sum ⊕ local.value [j]; endforall; endfor j; Calculate the partial sum for each processor -11-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Solving Reducing Problem on Hypercube SIMD Computer (cond’t) for j:=ceiling(logp)-1 downto 0 do for all Pi where 0 ≤ i ≤ p-1 do if i < 2j then tmp := [i + 2j]sum; sum := sum ⊕ tmp; endif; endforall; endfor j; Calculate the total sum by reducing for each dimension of the hypercube -12-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM A 2D-mesh with p*p processors need at least 2(p-1) steps to send data between two farthest nodes The lower bound of the complexity of any reduction sum algorithm is 0(n/p2 + p) Example: a 4*4 mesh need 2*3 steps to get the subtotals from the corner processors -13-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Solving Reducing Problem on 2D-Mesh SIMD Computer(cont’d) Example: compute the total sum on a 4*4 mesh Stage 1 Step i = 3 Stage 1 Step i = 2 Stage 1 Step i = 1 -14-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Solving Reducing Problem on 2D-Mesh SIMD Computer(cont’d) Example: compute the total sum on a 4*4 mesh Stage 2 Step i = 3 Stage 2 Step i = 2 Stage 2 Step i = 1 (the sum is at P1,1) -15-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Solving Reducing Problem on 2D-Mesh SIMD Computer(cont’d) Summation (2D-mesh SIMD with l*l processors Global i; Local tmp, sum; Begin {Each processor finds sum of its local value code not shown} for i:=l-1 downto 1 do for all Pj,i where 1 ≤ i ≤ l do {Processing elements in colum i active} tmp := right(sum); sum:= sum ⊕ tmp; end forall; endfor; Stage 1: Pi,1 computes the sum of all processors in row i-th -16-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Solving Reducing Problem on 2D-Mesh SIMD Computer(cont’d) for i:= l-1 downto 1 do for all Pi,1 do {Only a single processing element active} tmp:=down(sum); sum:=sum ⊕ tmp; end forall; endfor; End. Stage2: Compute the total sum and store it at P1,1 -17-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Solving Reducing Problem on UMA Multiprocessor Model(MIMD) Easily to access data like PRAM Processors execute asynchronously, so we must ensure that no processor access an “unstable” variable Variables used: Global a[0..n-1], {values to be added} p, {number of proeessor, a power of 2} flags[0..p-1], {Set to 1 when partial sum available} partial[0..p-1], {Contains partial sum} global_sum; {Result stored here} Local local_sum; -18-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Solving Reducing Problem on UMA Multiprocessor Model(cont’d) Example for UMA multiprocessor with p=8 processors P0 P1 P2 P3 P4 P5 P6 P7 Step j=8 Stage 2 Step j=4 Step j=2 Step j=1 The total sum is at P0 -19-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Solving Reducing Problem on UMA Multiprocessor Model(cont’d) Summation (UMA multiprocessor model) Begin for k:=0 to p-1 do flags[k]:=0; for all Pi where 0 ≤ i < p do local_sum :=0; for j:=i to n-1 step p do local_sum:=local_sum⊕ a[j]; Stage 1: Each processor computes the partial sum of n/p values -20-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Solving Reducing Problem on UMA Multiprocessor Model(cont’d) j:=p; while j>0 do begin if i ≥ j/2 then partial[i]:=local_sum; flags[i]:=1; break; else while (flags[i+j/2]=0) do; local_sum:=local_sum⊕ partial[i+j/2]; endif; j=j/2; end while; if i=0 then global_sum:=local_sum; end forall; End. Each processor waits for the partial sum of its partner available Stage 2: Compute the total sum -21-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Solving Reducing Problem on UMA Multiprocessor Model(cont’d) Algorithm complexity 0(n/p+p) What is the advantage of this algorithm compared with another one using critical-section style to compute the total sum? Design strategy 2: – Look for a data-parallel algorithm before considering a control-parallel algorithm On MIMD computer, we should exploit both data parallelism and control parallelism (try to develop SPMD program if possible) -22-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Description: – Given a message of length M stored at one processor, let’s send this message to all other processors Things to be considered: – Length of the message – Message passing overhead and data-transfer time -23-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM If the amount of data is small, the best algorithm takes logp communication steps on a p-node hypercube Examples: broadcasting a number on a 8-node hypercube P0 P1 P3 P2 P4 P5 P6 P7 Step 3: Send the number via the 3rd dimension of the hypercube Step 2: Send the number via the 2nd dimension of the hypercube P0 P1 P3 P2 P0 P1 Step 1: Send the number via the 1st dimension of the hypercube -24-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Broadcasting a number from P0 to all other processors Local i, {Loop iteration} p, {Partner processor} position; {Position in broadcast tree} value; {Value to be broadcast} Begin spawn(P0, P1,… ,,Pp-1); for j:=0 to logp-1 do for all Pi where 0 ≤ i ≤ p-1 do if i < 2j then partner := i+2j; [partner]value:=value; endif; endforall; end forj; End. -25-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM The previous algorithm – Uses at most p/2 out of plogp links of the hypercube – Requires time Mlogp to broadcast a length M msg not efficient to broadcast long messages Johhsson and Ho (1989) have designed an algorithm that executes logp times faster by: – Breaking the message into logp parts – Broadcasting each parts to all other nodes through a different biominal spanning tree -26-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Time to broadcast a msg of length M is Mlogp/logp = M The maximum number of links used simultaneously is plogp, much greater than that of the previous algorithm A B C C A B A B C B C A C A B A A B B C C -27-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Johnsson and Ho’s Broadcast Algorithm on Hypercube SIMD(cont’d) Design strategy 3 – As problem size grow, use the algorithm that makes best use of the available resources -28-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Description: – Given an associative operation ⊕ and an array A containing n elements, let’s compute the n quantities A[0] A[0] ⊕ A[1] A[0] ⊕ A[1] ⊕ A[2] … A[0] ⊕ A[1] ⊕ A[2] ⊕… ⊕ A[n-1] Cost-optimal PRAM algorithm: – ”Parallel Computing: Theory and Practice”, section 2.3.2, p. 32 -29-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Finding the prefix sums of 16 values Processor 0 3 2 7 6 18 18 35 43 62 3 5 12 18 (a) (b) (c) (d) Processor 1 0 5 4 8 17 18 35 43 62 18 23 27 35 Processor 2 2 0 1 5 8 18 35 43 62 37 37 38 43 Processor 3 2 3 8 6 19 18 35 43 62 45 48 56 62 -30-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Step (a) – Each processor is allocated with its share of values Step (b) – Each processor computes the sum of its local elements Step (c) – The prefix sums of the local sums are computed and distributed to all processor Step (d) – Each processor computes the prefix sum of its own elements and adds to each result the sum of the values held in lower-numbered processors
File đính kèm:
- Parallel Processing & Distributed Systems - Chapter 9 Parallel Algorithms.pdf