Parallel Processing & Distributed Systems - Chapter 2: Parallel Computer Models & Classification
An abstract machine model is mainly used in
the design and analysis of parallel algorithms
without worry about the details of physics
machines.
Three abstract machine models:
–PRAM
–BSP
– Phase Parallel
ThoaiNam Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Abstract Machine Models: – PRAM, BSP, Phase Parallel Pipeline, Processor Array, Multiprocessor, Data Flow Computer Flynn Classification: – SISD, SIMD, MISD, MIMD Pipeline Computer ThoaiNam Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM An abstract machine model is mainly used in the design and analysis of parallel algorithms without worry about the details of physics machines. Three abstract machine models: – PRAM – BSP – Phase Parallel Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM RAM (random access machine) … Memory Program Location counter r0 r1 r2 r3 …x2x1 xn…x2x1 Write-only output tape Read-only input tape Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Global memory Private memory Private memory Private memory Parallel random-access machine … P1 … P2 … Pn … Interconnection network … Control Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM A control unit An unbounded set of processors, each with its own private memory and an unique index Input stored in global memory or a single active processing element Step: (1) read a value from a single private/global memory location (2) perform a RAM operation (3) write into a single private/global memory location During a computation step: a processor may activate another processor All active, enable processors must execute the same instruction (albeit on different memory location) Computation terminates when the last processor halts Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Definition: The cost of a PRAM computation is the product of the parallel time complexity and the number of processors used. Ex: a PRAM algorithm that has time complexity O(log p) using p processors has cost O(p log p) Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Time complexity of a PRAM algorithm is often expressed in the big-O notation Machine size n is usually small in existing parallel computers Ex: – Three PRAM algorithms A, B and C have time complexities if 7n, (n log n)/4, n log log n. – Big-O notation: A(O(n)) < C(O(n log log n)) < B(O(n log n)) – Machines with no more than 1024 processors: log n ≤ log 1024 = 10 and log log n ≤ log log 1024 < 4 and thus: B < C < A Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM PRAM execution can result in simultaneous access to the same location in shared memory. – Exclusive Read (ER) » No two processors can simultaneously read the same memory location. – Exclusive Write (EW) » No two processors can simultaneously write to the same memory location. – Concurrent Read (CR) » Processors can simultaneously read the same memory location. – Concurrent Write (CW) » Processors can simultaneously write to the same memory location, using some conflict resolution scheme. Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Common/Identical CRCW – All processors writing to the same memory location must be writing the same value. – The software must ensure that different values are not attempted to be written. Arbitrary CRCW – Different values may be written to the same memory location, and an arbitrary one succeeds. Priority CRCW – An index is associated with the processors and when more than one processor write occurs, the lowest-numbered processor succeeds. – The hardware must resolve any conflicts Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Begin with a single active processor active Two phases: – A sufficient number of processors are activated – These activated processors perform the computation in parallel log p activation steps: p processors to become active The number of active processors can be double by executing a single instruction Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM 3650192834 9510107 91517 932 41 Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM (EREW PRAM Algorithm in Figure2-7, page 32, book [1]) Ex: SUM(EREW) Initial condition: List of n ≥ 1 elements stored in A[0..(n-1)] Final condition: Sum of elements stored in A[0] Global variables: n, A[0..(n-1)], j begin spawn (P0, P1,…, Pn/2 -1) for all Pi where 0 ≤ i ≤ n/2 -1 do for j← 0 to log n – 1 do if i modulo 2j = 0 and 2i+2j < n the A[2i] ← A[2i] + A[2i+2j] endif endfor endfor end Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM BSP – Bulk Synchronous Parallel BSP Model – Proposed by Leslie Valiant of Harvard University – Developed by W.F.McColl of Oxford University Communication Network (g) P M P M P M Node (w) Node Node Barrier (l) Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM A set of n nodes (processor/memory pairs) Communication Network – Point-to-point, message passing (or shared variable) Barrier synchronizing facility – All or subset Distributed memory architecture Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM A BSP program: – n processes, each residing on a node – Executing a strict sequence of supersteps – In each superstep, a process executes: »Computation operations: w cycles »Communication: gh cycles » Barrier synchronization: l cycles Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM The basic time unit is a cycle (or time step) w parameter – Maximum computation time within each superstep – Computation operation takes at most w cycles. g parameter – Number of cycles for communication of unit message when all processors are involved in communication - network bandwidth – (total number of local operations performed by all processors in one second) / (total number of words delivered by the communication network in one second) – h relation coefficient – Communication operation takes gh cycles. l parameter – Barrier synchronization takes l cycles. Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Superstep 1 Superstep 2 Barrier P1 P2 P3 P4 Computation Communication Barrier Computation Communication Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Time Complexity of BSP Algorithms Execution time of a superstep: – Sequence of the computation, the communication, and the synchronization operations: w + gh + l – Overlapping the computation, the communication, and the synchronization operations: max{w, gh, l} Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Proposed by Kai Hwang & Zhiwei Xu Similar to the BSP: – A parallel program: sequence of phases – Next phase cannot begin until all operations in the current phase have finished – Three types of phases: » Parallelism phase: the overhead work involved in process management, such as process creation and grouping for parallel processing » Computation phase: local computation (data are available) » Interaction phase: communication, synchronization or aggregation (e.g., reduction and scan) Different computation phases may execute different workloads at different speed. Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM A parallel machine model (also known as programming model, type architecture, conceptual model, or idealized model) is an abstract parallel computer from programmer‘s viewpoint, analogous to the von Neumann model for sequential computing. The abstraction need not imply any structural information, such as the number of processors and interprocessor communication structure, but it should capture implicitly the relative costs of parallel computation. Every parallel computer has a native model that closely reflects ist own architecture. Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Five semantic attributes – Honogeneity – Synchrony – Interaction mechanism – Address space – Memory model Several performance attributes – Machine size – Clock rate – Workload – Speedup, efficiency, utilization – Startup time …
File đính kèm:
- Parallel Processing & Distributed Systems - Chapter 2 Parallel Computer Models & Classification.pdf