Parallel Processing & Distributed Systems - Chapter 3: Parallel Computer Architectures

Based on notions of instruction and data streams
–SISD(Single Instruction stream, a Single Data stream )
–SIMD(Single Instruction stream, Multiple Data streams )
–MISD(Multiple Instruction streams, a Single Data stream)
–MIMD(Multiple Instruction streams, Multiple Data stream)
Popularity
–MIMD> SIMD> MISD
37 trang | Chuyên mục: Xử Lý Song Song | Chia sẻ: dkS00TYs | Lượt xem: 1599 | Lượt tải: 2
Tóm tắt nội dung Parallel Processing & Distributed Systems - Chapter 3: Parallel Computer Architectures, để xem tài liệu hoàn chỉnh bạn click vào nút "TẢI VỀ" ở trên
ons of instruction and data streams
– SISD (Single Instruction stream, a Single Data stream )
– SIMD (Single Instruction stream, Multiple Data streams )
– MISD (Multiple Instruction streams, a Single Data stream)
– MIMD (Multiple Instruction streams, Multiple Data stream)
 Popularity
– MIMD > SIMD > MISD
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 SISD 
– Conventional sequential machines
IS : Instruction Stream DS : Data Stream
CU : Control Unit PU : Processing Unit
MU : Memory Unit
CU PU MU
IS
IS DS
I/O
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 SIMD
– Vector computers, processor arrays
– Special purpose computations
PE : Processing Element LM : Local Memory
CU
PE1 LM1
PEn LMn
DS
DS
DS
DS
IS IS
Program loaded 
from host
Data sets 
loaded from host
SIMD architecture with distributed memory
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 MISD 
– Systolic arrays
– Special purpose computations
Memory
(Program,
Data) PU1 PU2 PUn
CU1 CU2 CUn
DS DS DS
IS IS
IS IS
IS
DS
I/O
MISD architecture (the systolic array)
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 MIMD
– General purpose parallel computers
CU1 PU1
Shared
Memory
IS
IS DS
I/O
CUn PUn
IS DSI/O
IS
MIMD architecture with shared memory
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Classification based on 
Architecture
 Pipelined Computers
 Dataflow Architectures
 Data Parallel Systems
 Multiprocessors
 Multicomputers
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 Instructions are divided into a number of steps 
(segments, stages)
 At the same time, several instructions can be 
loaded in the machine and be executed in 
different steps
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
– IF – instruction fetch
– ID – instruction decode and register fetch
– EX- execution and effective address calculation
– MEM – memory access
– WB- write back
Instruction i IF ID EX WB
IF ID EX WB
IF ID EX WB
IF ID EX WB
IF ID EX WB
Instruction i+1
Instruction i+2
Instruction i+3
Instruction # 1 2 3 4 5 6 7 8
Cycles
MEM
MEM
MEM
MEM
MEM
9
Instruction i+4
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 Data-driven model
– A program is represented as a directed acyclic graph in 
which a node represents an instruction and an edge 
represents the data dependency relationship between the 
connected nodes
– Firing rule
» A node can be scheduled for execution if and only if its 
input data become valid for consumption
 Dataflow languages
– Id, SISAL, Silage, LISP,...
– Single assignment, applicative(functional) language
– Explicit parallelism
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
z = (a + b) * c
+
*
a
b
c
z
The dataflow representation of an arithmetic expression
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 Execution of instructions is driven by data availability
– What is the difference between this and normal (control 
flow) computers?
 Advantages
– Very high potential for parallelism
– High throughput 
– Free from side-effect
 Disadvantages
– Time lost waiting for unneeded arguments
– High control overhead
– Difficult in manipulating data structures
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
input d,e,f
c0 = 0
for i from 1 to 4 
do
begin
ai := di / ei
bi := ai * fi
ci := bi + ci-1
end
output a, b, c
/
d1 e1
*
+
f1
/
d2 e2
*
+
f2
/
d3 e3
*
+
f3
/
d4 e4
*
+
f4
a1
b1
a2
b2
a3
b3
a4
b4
c0 c4
c1 c2 c3
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Execution on 
a Control Flow Machine
a1 b1 c1 a2 b2 c2 a4 b4 c4
Sequential execution on a uniprocessor in 24 cycles
Assume all the external inputs are available before entering do loop
+ : 1 cycle, * : 2 cycles, / : 3 cycles, 
How long will it take to execute this program on a dataflow 
computer with 4 processors?
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Execution on 
a Dataflow Machine
c1 c2 c3 c4a1
a2
a3
a4
b1
b2
b4
b3
Data-driven execution on a 4-processor dataflow computer in 9 cycles
Can we further reduce the execution time of this program ?
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 Programming model 
– Operations performed in parallel on each element 
of data structure
– Logically single thread of control, performs 
sequential or parallel steps
– Conceptually, a processor associated with each 
data element
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 SIMD Architectural model
– Array of many simple, cheap processors with little 
memory each
» Processors don’t sequence through instructions
– Attached to a control processor that issues 
instructions
– Specialized and general communication, cheap global 
synchronization
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 Instruction set includes operations on vectors 
as well as scalars
 2 types of vector computers
– Processor arrays
– Pipelined vector processors
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 A sequential computer connected with a set of identical processing 
elements simultaneouls doing the same operation on different data. 
Eg CM-200
Data path
Processing 
element Data 
memory
I
n
t
e
r
c
o
n
n
e
c
t
i
o
n
n
e
t
w
o
r
k
Processing 
element Data 
memory
Processing 
element Data 
memory
I/0
Processor array
Instruction 
path
Program and 
Data Memory
CPU
I/O processor
Front-end computer
I/0
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 Stream vector from 
memory to the CPU
 Use pipelined 
arithmetic units to 
manipulate data
 Eg: Cray-1, Cyber-205
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 Consists of many fully programmable 
processors each capable of executing its own 
program
 Shared address space architecture
 Classified into 2 types
– Uniform Memory Access (UMA) Multiprocessors
– Non-Uniform Memory Access (NUMA) 
Multiprocessors
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 Uses a central switching mechanism to reach a centralized 
shared memory
 All processors have equal access time to global memory
 Tightly coupled system
 Problem: cache consistency
P1
Switching mechanism
I/O
P2 … Pn
Memory banks
C1 C2 Cn
Pi
Ci
Processor i
Cache i
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 Crossbar switching mechanism
Mem
Mem
Mem
Mem
Cache
P
I/OCache
P
I/O
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 Shared-bus switching mechanism
Mem Mem Mem Mem
Cache
P
I/OCache
P
I/O
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 Packet-switched network
Network
Mem Mem
Cache
P
Cache
P
Cache
P
Mem
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 Distributed shared 
memory combined 
by local memory of 
all processors
 Memory access 
time depends on 
whether it is local 
to the processor
 Caching shared 
(particularly 
nonlocal) data?
Network
Cache
P
Mem Cache
P
Mem
Cache
P
Mem Cache
P
Mem
Distributed Memory
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Current Types of Multiprocessors
 PVP (Parallel Vector Processor)
– A small number of proprietary vector processors 
connected by a high-bandwidth crossbar switch
 SMP (Symmetric Multiprocessor)
– A small number of COST microprocessors connected by a 
high-speed bus or crossbar switch
 DSM (Distributed Shared Memory)
– Similar to SMP
– The memory is physically distributed among nodes.
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
VP VP
Crossbar Switch
VP
SM SM SM
VP : Vector Processor
SM : Shared Memory
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
SMP (Symmetric Multi-Processor)
P/C P/C
Bus or Crossbar Switch
P/C
SM SM SM
P/C : Microprocessor and Cache
SM: Shared Memory
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
DSM (Distributed Shared Memory)
Custom-Designed Network
MB MB
P/C
LM
NIC
DIR
P/C
LM
NIC
DIR
MB: Memory Bus
P/C: Microprocessor & Cache
LM: Local Memory
DIR: Cache Directory
NIC: Network Interface 
Circuitry
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 Consists of many processors with their own memory
 No shared memory
 Processors interact via message passing  loosely 
coupled system
P/C P/C
Message-passing 
Interconnection Network
P/C
M M M
P/C: Microprocessor & Cache
M: Memory
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Current Types of Multicomputers
MPP (Massively Parallel Processing)
– Total number of processors > 1000
 Cluster
– Each node in system has less than 
16 processors.
 Constellation
– Each node in system has more than 
16 processors
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
MPP 
(Massively Parallel Processing)
P/C
LM
NIC
MB
P/C
LM
NIC
MB
Custom-Designed Network
P/C: Microprocessor & Cache MB: Memory Bus 
NIC: Network Interface Circuitry LM: Local Memory
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Commodity Network (Ethernet, ATM, Myrinet, VIA)
MB MB
P/C
M
NIC
P/C
M
Bridge Bridge
LD LD
NIC
IOB IOB
MB: Memory Bus
P/C: Microprocessor & 
Cache
M: Memory
LD: Local Disk
IOB: I/O Bus
NIC: Network Interface 
Circuitry
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
P/CP/C
SM SM
NICLD
Hub
Custom or Commodity Network
>= 16
IOC
P/CP/C
SM SM
NICLD
Hub
>= 16
IOC
P/C: Microprocessor & Cache MB: Memory Bus
NIC: Network Interface Circuitry SM: Shared Memory
IOC: I/O Controller LD: Local Disk
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Trend in Parallel Computer 
Architectures
0
50
100
150
200
250
300
350
400
1997 1998 1999 2000 2001 2002
Years
MPPs Constellations Clusters SMPs
File đính kèm:
Parallel Processing & Distributed Systems - Chapter 3 Parallel Computer Architectures.pdf