Computer Architecture A Quantitative Approach, Fifth Edition Chapter

Computer Architecture A Quantitative Approach, Fifth Edition Chapter

Computer Architecture A Quantitative Approach, Fifth Edition Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Copyright 2012, Elsevier Inc. All rights reserved. 1 Flynn taxonomy Single instruction stream single data stream (SISD) Single instruction stream, multiple data streams (SIMD) Multiple instruction streams, single data stream (MISD) Vector processor, Multimedia extensions on modern processors

Stream processors (somewhat) Exotic fault tolerance architectures (eg Space Shuttle) Multiple instruction streams, multiple data streams (MIMD) Multi-core systems Multi-processor systems Distributed systems Copyright 2012, Elsevier Inc. All rights reserved. 2 SIMD architectures can exploit significant datalevel parallelism for: Introduction Introduction matrix-oriented scientific computing

media-oriented image and sound processors SIMD allows programmer to continue to think sequentially Copyright 2012, Elsevier Inc. All rights reserved. 3 VECTOR ARCHITECTURES Copyright 2012, Elsevier Inc. All rights reserved. 4 Basic idea: Read sets of data elements into vector registers Operate on those registers Disperse the results back into memory Vector Architectures

Vector Architectures Registers are controlled by compiler Used to hide memory latency Leverage memory bandwidth Copyright 2012, Elsevier Inc. All rights reserved. 5 Example (fictional) architecture: VMIPS Loosely based on Cray-1 Vector registers

Fully pipelined Data and control hazards are detected Vector load-store unit Each register holds a 64-element, 64 bits/element vector Register file has 16 read ports and 8 write ports Vector functional units Vector Architectures VMIPS Fully pipelined One word per clock cycle after initial latency Scalar registers 32 general-purpose registers

32 floating-point registers Copyright 2012, Elsevier Inc. All rights reserved. 6 ADDVV.D: add two vectors ADDVS.D: add vector to a scalar LV/SV: vector load and vector store from address Vector Architectures VMIPS Instructions Example: DAXPY (Y = a * X + Y) L.D F0,a ; load scalar a LV V1,Rx ; load vector X MULVS.D

V2,V1,F0 ; vector-scalar multiply LV V3,Ry ; load vector Y ADDVV V4,V2,V3 ; add SV Ry,V4 ; store the result Requires 6 instructions vs. almost 600 for MIPS Copyright 2012, Elsevier Inc. All rights reserved. 7 Instructions (p. 266) ADDVV.D V1,V2,V3 ADDVS.D V1,V2,F0 SUBVV.D V1,V2,V3 SUBVS.D V1,V2,F0 SUBSV.D V1,F0,V3 MULVV.D V1,V2,V3 DIVVV.D V1,V2,V3 LV V1,R1 SV R1,V1 LVWS V1,(R1,R2) Load w/ stride (R1+i*R2) S--VV.D

V1,V2 Compare elements (EQ, NE,GT) S--VS.D V1,F0 and set vector mask register (VM) Copyright 2012, Elsevier Inc. All rights reserved. 8 Execution time depends on three factors: Length of operand vectors Structural hazards Data dependencies Vector Architectures Vector Execution Time VMIPS functional units consume one element per clock cycle

Execution time is approximately the vector length Copyright 2012, Elsevier Inc. All rights reserved. 9 Start up time Latency of vector functional unit Assume the same as Cray-1 Vector Architectures Challenges Floating-point add => 6 clock cycles Floating-point multiply => 7 clock cycles

Floating-point divide => 20 clock cycles Vector load => 12 clock cycles Improvements: > 1 element per clock cycle Non-64 wide vectors IF statements in vector code Memory system optimizations to support vector processors Multiple dimensional matrices Sparse matrices Programming a vector computer Copyright 2012, Elsevier Inc. All rights reserved. 10 Element n of vector register A is hardwired to element n of vector register B

Allows for multiple hardware lanes Copyright 2012, Elsevier Inc. All rights reserved. Vector Architectures Multiple Lanes 11 Vector length not known at compile time? Use Vector Length Register (VLR) Use strip mining for vectors over the maximum length: Vector Architectures Vector Length Register low = 0; VL = (n % MVL); /*find odd-size piece using modulo op % */ for (j = 0; j <= (n/MVL); j=j+1) { /*outer loop*/ for (i = low; i < (low+VL); i=i+1) /*runs for length VL*/ Y[i] = a * X[i] + Y[i] ; /*main operation*/ low = low + VL; /*start of next vector*/

VL = MVL; /*reset the length to maximum vector length*/ } Copyright 2012, Elsevier Inc. All rights reserved. 12 Consider: for (i = 0; i < 64; i=i+1) if (X[i] != 0) X[i] = X[i] Y[i]; Use vector mask register to disable elements: Vector Architectures Vector Mask Registers LV V1,Rx ;load vector X into V1 LV V2,Ry ;load vector Y L.D F0,#0

;load FP zero into F0 SNEVS.D V1,F0 ;sets VM(i) to 1 if V1(i)!=F0 SUBVV.D V1,V1,V2 ;subtract under vector mask SV Rx,V1 ;store the result in X GFLOPS rate decreases! Copyright 2012, Elsevier Inc. All rights reserved. 13 Memory system must be designed to support high bandwidth for vector loads and stores Spread accesses across multiple banks

Vector Architectures Memory Banks Control bank addresses independently Load or store non sequential words Support multiple vector processors sharing the same memory Example: 32 processors, each generating 4 loads and 2 stores/cycle Processor cycle time is 2.167 ns, SRAM cycle time is 15 ns How many memory banks needed? Copyright 2012, Elsevier Inc. All rights reserved. 14 Consider: for (i = 0; i < 100; i=i+1) for (j = 0; j < 100; j=j+1) { A[i][j] = 0.0; for (k = 0; k < 100; k=k+1)

A[i][j] = A[i][j] + B[i][k] * D[k][j]; } Must vectorize multiplication of rows of B with columns of D Use non-unit stride Vector Architectures Stride: handling multidimensional arrays Loading: load non-consecutive memory locations in consecutive vector register locations Saving Complicates the memory system Copyright 2012, Elsevier Inc. All rights reserved. 15

In sparse matrices, the elements of the vector are stored in some compacted form and accessed indirectly. Consider: for (i = 0; i < n; i=i+1) A[K[i]] = A[K[i]] + C[M[i]]; Vector Architectures Scatter-Gather: handling sparse matrices Use index vector which indicates the non-zero elements of the vector, and allows to load them LV Vk, Rk ;load K LVI Va, (Ra+Vk) ;load A[K[]] LV Vm, Rm ;load M LVI Vc, (Rc+Vm) ;load C[M[]]

ADDVV.D Va, Va, Vc ;add them SVI (Ra+Vk), Va ;store A[K[]] Copyright 2012, Elsevier Inc. All rights reserved. 16 SIMD EXTENSIONS Copyright 2012, Elsevier Inc. All rights reserved. 17 Media applications operate on data types narrower than the native word size Example: disconnect carry chains to partition adder Limitations, compared to vector instructions: Number of data operands encoded into op code No sophisticated addressing modes (strided, scattergather) No mask registers

Copyright 2012, Elsevier Inc. All rights reserved. SIMD Instruction Set Extensions for Multimedia SIMD Extensions 18 Implementations: Intel MMX (1996) Streaming SIMD Extensions (SSE) (1999) Eight 16-bit integer ops Four 32-bit integer/fp ops or two 64-bit integer/fp ops Advanced Vector Extensions (2010)

Eight 8-bit integer ops or four 16-bit integer ops Four 64-bit integer/fp ops SIMD Instruction Set Extensions for Multimedia SIMD Implementations Operands must be consecutive and aligned memory locations Copyright 2012, Elsevier Inc. All rights reserved. 19 SIMD extensions today (Fall 2017) Latest iteration of SIMD extensions AVX-512 Available in Intel i9 series

Intel AVX-512 features include 32 vector registers each 512 bits wide, eight dedicated mask registers, 512-bit operations on packed floating point data or packed integer data, embedded rounding controls (override global settings), embedded broadcast, embedded floating-point fault suppression, embedded memory fault suppression, new operations, additional gather/scatter support, high speed math instructions, compact representation of large displacement value, and the ability to have optional capabilities beyond the foundational capabilities. It is interesting to note that the 32 ZMM registers represent 2K of register space! Copyright 2012, Elsevier Inc. All rights reserved. 20

Recently Viewed Presentations

  • Research on Multi-Agent Systems with Applications to Simulation

    Research on Multi-Agent Systems with Applications to Simulation

    Research on Multi-Agent Systems with Applications to Simulation and Training Thomas R. Ioerger Associate Professor Department of Computer Science
  • College Fit &amp; the Admission Process

    College Fit & the Admission Process

    Put the work back on the student - examples reflection questions, items to research, potential majors or academic programs, add college to list in Naviance, have a conversation with parent or guardian about college; steps to study for the SAT/ACT...
  • Ceci est un exemple de présentation pour votre oral de ...

    Ceci est un exemple de présentation pour votre oral de ...

    Ceci est un exemple de présentation pour votre oral de présentation étude et projet. A ne pas copier / coller pour votre propre présentation !!!
  • What is the RowanCard? The RowanCard is your

    What is the RowanCard? The RowanCard is your

    Know what your ISO card number is. The first 12 digits are printed on the back of your card. The final 4 can be found at id.rowan.edu. Each time you get a new student ID, your card number will change....
  • Lipid Based Oral Drug Delivery System (LBODDS)

    Lipid Based Oral Drug Delivery System (LBODDS)

    Formation of chylomicron and storage in golgi apparatus. Exocytosis into extracellular phase. Size-selective Process. Lipophilic drugs with logP > 5 with TG solubility > 50 mg/ml enters lymphatic delivery. Chylomicrons are big in size and access lymphatic transport. FFA <...
  • &#x27;Half-Caste&#x27; by John Agard

    'Half-Caste' by John Agard

    'Half-Caste' by John Agard Starter What does multi-cultural mean? What makes us a multi-cultural society? Background Information John Agard was born in Guyana, in the Caribbean in 1949, to parents of mixed nationality.
  • Enlightenment and Revolution in England and America (1550-1789)

    Enlightenment and Revolution in England and America (1550-1789)

    Sophia's son George, the first of the Hanoverian dynasty, became King George I of Great Britain. Queen Anne, who ruled from 1702 to 1714, had seventeen children. None survived her. Sophia of Hanover, granddaughter of James I, would have been...
  • Romeo &amp; Juliet SLICED in QUOTES - WordPress.com

    Romeo & Juliet SLICED in QUOTES - WordPress.com

    Quick Slice of Learning: Macbeth & Metre. Trochaic tetrameter is a rapid meter of poetry consisting of four feet of trochees.A trochee is made up of followed by one stressed syllable and one unstressed syllable: DAdum / DAdum / DAdum...