Parallel Processing & Distributed Systems - Chapter 10: Matrix Multiplication

Sequential matrix multiplication
Algorithms for processor arrays
– Matrix multiplication on 2-D mesh SIMD model
– Matrix multiplication on hypercube SIMD model
Matrix multiplication on UMA
multiprocessors
Matrix multiplication on multicomputers
23 trang | Chuyên mục: Xử Lý Song Song | Chia sẻ: dkS00TYs | Lượt xem: 1370 | Lượt tải: 1
Tóm tắt nội dung Parallel Processing & Distributed Systems - Chapter 10: Matrix Multiplication, để xem tài liệu hoàn chỉnh bạn click vào nút "TẢI VỀ" ở trên
ThoaiNam
-2-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Sequential matrix multiplication
Algorithms for processor arrays
– Matrix multiplication on 2-D mesh SIMD model
– Matrix multiplication on hypercube SIMD model
Matrix multiplication on UMA 
multiprocessors
Matrix multiplication on multicomputers
-3-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Global a[0..l-1,0..m-1], b[0..m-1][0..n-1], {Matrices to be multiplied}
c[0..l-1,0..n-1], {Product matrix}
t, {Accumulates dot product}
i, j, k;
Begin
for i:=0 to l-1 do
for j:=0 to n-1 do
t:=0;
for k:=0 to m-1 do
t:=t+a[i][k]*b[k][j];
endfor k;
c[i][j]:=t;
endfor j;
endfor i; 
End.
-4-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Matrix multiplication on 2-D mesh SIMD model
Matrix multiplication on Hypercube SIMD model
-5-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Gentleman(1978) has shown that multiplication of 
to n*n matrices on the 2-D mesh SIMD model 
requires 0(n) routing steps
We will consider a multiplication algorithm on a 2-
D mesh SIMD model with wraparound
connections
-6-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
For simplicity, we suppose that
– Size of the mesh is n*n
– Size of each matrix (A and B) is n*n
– Each processor Pi,j in the mesh (located at row i,column 
j) contains ai,j and bi,j
At the end of the algorithm, Pi,j will hold the element 
ci,j of the product matrix
-7-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Major phases
(a) Initial distribution 
of matrices A and B
(b) Staggering all A’s elements 
in row i to the left by i positions 
and all B’s elements in col j 
upwards by i positions 
a3,3
b3,3
a3,2
b3,2
a3,1
b3,1
a3,0
b3,0
a2,3
b2,3
a2,2
b2,2
a2,1
b2,1
a2,0
b2,0
a1,3
b1,3
a1,2
b1,2
a1,1
b1,1
a1,0
b1,0
a0,3
b0,3
a0,2
b0,2
a0,1
b0,1
a0,0
b0,0
a3,3
b3,0
a2,3
b3,1
a2,2
b2,0
a1,3
b3,2
a1,2
b2,1
a1,1
b1,0
a0,3
b3,3
a0,2
b2,2
a0,1
b1,1
a0,0
b0,0
a3,2a3,1a3,0
a2,1a2,0
a1,0
b2,3b1,2b0,1
b1,3b0,2
b0,3
-8-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
(c) Distribution of 2 matrices A 
and B after staggering in a 2-D 
mesh with wrapparound 
connection
a3,2
b2,3
a3,1 
b1,2
a3,0
b0,1
a3,3
b3,0
a2,1
b1,3
a2,0
b0,2
a2,3
b3,1
a2,2
b2,0
a1,0
b0,3
a1,3
b3,2
a1,2
b2,1
a1,1
b1,0
a0,3
b3,3
a0,2
b2,2
a0,1
b1,1
a0,0
b0,0
(b) Staggering all A’s elements 
in row i to the left by i positions 
and all B’s elements in col j 
upwards by i positions 
a3,3
b3,0
a2,3
b3,1
a2,2
b2,0
a1,3
b3,2
a1,2
b2,1
a1,1
b1,0
a0,3
b3,3
a0,2
b2,2
a0,1
b1,1
a0,0
b0,0
a3,2a3,1a3,0
a2,1a2,0
a1,0
b2,3b1,2b0,1
b1,3b0,2
b0,3 Each processor P(i,j) has a 
pair of elements to multiply 
ai,k and bk,j
-9-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 The rest steps of the algorithm from the viewpoint of processor P(1,2)
a1,3
b3,2
b2,2
b0,2
b1,2
a1,1 a1,2 a1,0
(a) First scalar multiplication step
a1,0
b0,2
b3,2
b1,2
b2,2
a1,2 a1,3 a1,1
(b) Second scalar multiplication step 
after elements of A are cycled to the 
left and elements of B are cycled 
upwards
-10-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
a1,1
b1,2
b0,2
b2,2
b3,2
a1,3 a1,0 a1,2
(c) Third scalar multiplication step after 
second cycle step
(d) Third scalar multiplication step after 
second cycle step. At this point 
processor P(1,2) has computed the 
dot product c1,2
a1,2
b2,2
b1,2
b3,2
b0,2
a1,0 a1,1 a1,3
-11-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Detailed Algorithm
Global n, {Dimension of matrices}
k ;
Local a, b, c;
Begin
for k:=1 to n-1 do
forall P(i,j) where 1 ≤ i,j < n do
if i ≥ k then a:= fromleft(a); 
if j ≥ k then b:=fromdown(b);
end forall;
endfor k;
Stagger 2 matrices
a[0..n-1,0..n-1] and b[0..n-1,0..n-1]
-12-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
forall P(i,j) where 0 ≤ i,j < n do
c:= a*b;
end forall;
for k:=1 to n-1 do
forall P(i,j) where 0 ≤ i,j < n do
a:= fromleft(a); 
b:=fromdown(b);
c:= c + a*b;
end forall;
endfor k;
End.
Compute dot product
-13-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Can we implement the above mentioned algorithm 
on a 2-D mesh SIMD model without wrapparound 
connection?
-14-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Design strategy 5
– If load balancing is not a problem, maximize grain size
Grain size: the amount of work performed between 
processor interactions
Things to be considered
– Parallelizing the most outer loop of the sequential 
algorithm is a good choice since the attained grain size 
(0(n3/p)) is the biggest
– Resolving memory contention as much as possible
-15-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Algorithm using p processors
Global n, {Dimension of matrices}
a[0..n-1,0..n-1], b[0..n-1,0..n-1]; {Two input matrices}
c[0..n-1,0..n-1]; {Product matrix}
Local i,j,k,t;
Begin
forall Pm where 1 ≤ m ≤ p do
for i:=m to n step p do
for j:= 1 to n to
t:=0;
for k:=1 to n do t:=t+a[i,k]*b[k,j];
endfor j;
c[i][j]:=t; 
endfor i;
end forall;
End.
-16-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Things to be considered
– Try to resolve memory contention as much as possible
– Increase the locality of memory references to reduce 
memory access time
Design strategy 6
– Reduce average memory latency time by increasing 
locality
The block matrix multiplication algorithm is a 
reasonable choice in this situation 
– Section 7.3, p.187, Parallel Computing: Theory and Practice
-17-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
We will study 2 algorithms on multicomputers
– Row-Column-Oriented Algorithm
– Block-Oriented Algorithm
-18-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
The processes are organized as a ring
– Step 1: Initially, each process is given 1 row of the matrix 
A and 1 column of the matrix B
– Step 2: Each process uses vector multiplication to get 1 
element of the product matrix C.
– Step 3: After a process has used its column of matrix B, it 
fetches the next column of B from its successor in the 
ring
– Step 4: If all rows of B have already been processed, 
quit. Otherwise, go to step 2
-19-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Why do we have to organize processes as a ring 
and make them use B’s rows in turn?
Design strategy 7:
– Eliminate contention for shared resources by changing 
the order of data access
-20-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
A B C
A B C A B C
A B C
Example: Use 4 processes to mutliply two matrices A4*4 and B4*4
1st iteration
-21-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
A B C
A B C A B C
A B C
Example: Use 4 processes to mutliply two matrices A4*4 and B4*4
2nd iteration
-22-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
A B C
A B C A B C
A B C
Example: Use 4 processes to mutliply two matrices A4*4 and B4*4
3rd iteration
-23-Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
A B C
A B C A B C
A B C
Example: Use 4 processes to mutliply two matrices A4*4 and B4*4
4th iteration 
(the last)
File đính kèm:
Parallel Processing & Distributed Systems - Chapter 10 Matrix Multiplication.pdf