Application Configurable Processors

Application Configurable Processors

Facilitating Compiler Optimizations Through the Dynamic Mapping of Alternate Register Structures Cases 2007 Florida State University Chris Zimmer, Steve Hines, Prasad Kulkarni Gary Tyson, David Whalley Motivation Embedded Processors have fewer registers. Compiler Optimizations increase register pressure Difficult to apply aggressive compiler optimizations on embedded systems 2 Vector Multiply Example Even before aggressive optimizations, 60% of

available registers are already used Further optimizations like Loop Unrolling and Software Pipelining are inhibited int A[1000], B[1000]; void vmul() { int I; for (I=2; I < 1000; I++) B[I] = A[I] * B[I-2]; } 3 .L3: ldr ldr mul str add cmp blt r1,[r2,r3, lsl #2] r12,[r4], #4 r0,r12,r1

r0,[r5,r3, lsl #2] r3,r3,#1 r3, #1000 .L3 Application Configurable Processors Exploit common reference patterns found in code Small register files mimic these reference behaviors. Map Table provides register redirection. Changed architecture to add more registers, but have minimal impact on ISA support, particularly not increasing operand size 4 Architectural Modifications R0 R0

R1 Q1 Register File Map Table R6 R6 R15 R15 Queue Q1 Queue Q2 Queue Q3 Stack Q4 Circular Buffer Q5 5

Software Pipelining Software pipelining is not often found in embedded compilers. Software pipelining cycle time of a loop. reduces the overall Extracts iterations Consumes Stalls Consumes registers!! 6 Software Pipelining Example int A[1000], B[1000]; void vmul() { int I; for (I=2; I < 1000; I++) B[I] = A[I] * C[I]; } Stalls Present when Loop Run

.L3: ldr r1,[r2,r3, lsl #2] ldr r12,[r4], #4 .L3: 7 stall ldr r1,[r2,r3, lsl #2] stall ldr r12,[r4], #4 stall mul r0,r12,r1

str r0,[r5,r3, lsl #2] stall add r3,r3,#1 stall cmp r3, #1000 stall blt .L3 mul r0,r12,r1 str r0,[r5,r3, lsl #2] add r3,r3,#1

cmp r3, #1000 Instruction Goal: Minimal modification to existing instruction set. Single cycle instruction latency Method: Add a single instruction to the ISA that is used to map and unmap a common register specifier into a customized register structure. qmap qmap r3,#4,q3 8 Architectural Modifications R0 R0 R1

Q1 Register File Map Table R6 R6 R15 R15 Queue Q1 Queue Q2 An access to R0, which has no mapping in the table would get the data from the register file. 9

R1 is mapped into Q1 and would retrieve its data from there. Queue Q3 Destructive Queue Q4 Circular Buffer Q5 Software Pipelining Example int A[1000], B[1000]; void vmul() { int I; for (I=2; I < 1000; I++) B[I] = A[I] * C[I]; } Q1 30 25 15 5 Q2 34

2 1 Q3 30 75 10 5 Register Usage Loads 8x4 Register Savings Using Register Structures Benchmark AR in Original Loop AR needed to Pipeline AR contained in customized structures N Real Updates 10 10 6 Dot Product 9 9 4 Matrix Multiply

9 9 4 Fir 6 6 4 Mac 10 8 10 Fir2Dim 3 Similar Loops 10 10 4 Loads 16x4 Register Savings Using Register Structures N Real Updates 10 10 6 Dot Product 9 9 4 Matrix Multiply 9

9 4 Fir 6 6 4 Mac 10 8 12 Fir2Dim 10 10 4 Loads 32x4 Register Savings Using Register Structures N Real Updates 10 10 9 Dot Product 9 9 8 Matrix Multiply 9 9

8 Fir 6 6 12 Mac 10 8 18 Fir2Dim 10 10 8 11 Results Multiplies varying latency, load latency set at four Percent Cycle Reduction In-Order Issue 50 Dot Product 40

Matrix 30 Fir 20 N Real Updates 10 Conv 45 0 Mac 2 4 8 Multiply Latency 12

16 32 Fir2Dim Results Loads varying latency, multiply latency set at four Percent Cycle Reduction In-Order Issue 60 50 Dot Product 40 Matrix 30 Fir

20 N Real Upates 10 Conv45 0 -10 Mac 2 4 8 Load Latency 13 16 32 Fir2Dim

Conclusions Customized register structures reduce register pressure. Software pipelining is viable in resource constrained environments Performance can be improved with minor impact to the ISA. 14 Extras Reference Behaviors Stack Reference Behavior ldr r1,[r6,r4, lsl #4] ldr r12,[r6,r4, lsl #8] ldr r8,[r6,r4, lsl #12] str r8,[r3,r4, lsl #16] str r12,[r3,r4, lsl #20]

str r1,[r3,r4, lsl #24] 16 Application Configurable Architecture Application configurable processors are designed using a mapping table similar to a register rename table found in many out of order implementations. The map table is read during every access to the architected register file. This serves as a method of determining if a register specifier is used in the original architected register file or a customized register structure. 17 Application Configurable Architecture The customized register files are small in size but they efficiently manage the values that would require many architected

registers. The customized register files can mimic queues, stacks, and circular buffers. These structures are accessed using the same register specifier that is used to access the architected register file. 18 Remove Reference Behaviors ldr r1,[r6,r4, lsl #4] ldr r12,[r6,r4, lsl #8] ldr r8,[r6,r4, lsl #12] str r8,[r3,r4, lsl #16] r1 R8 str r12,[r3,r4, lsl #20] str r1,[r3,r4, lsl #24] R12 Stack Reference

Behavior R1 ldr r1,[r6,r4, lsl #4] ldr r1,[r6,r4, lsl #8] ldr r1,[r6,r4, lsl #12] str r1,[r3,r4, lsl #16] str r1,[r3,r4, lsl #20] 19 str r1,[r3,r4, lsl #24] Free up r8 and r12 for use. Remove Qmap Instruction q0 R8 R12 R1

Free up r8 and r12 for use. 20 Modulo Scheduling For our work we used modulo scheduling. This requires using the dependences and latencies of the loop instructions to generate a modulo scheduled loop. The prolog and epilog are then built based off of this schedule. The prolog and epilog in require register renaming of loop carried dependencies to verify a correct loop. Renaming in embedded processors is often not possible. 21 Register Renaming due to software pipelining Renaming doesnt work not enough

registers. Rotating registers would require a significant rewrite of the embedded ISA. The loop carried values can simply be mapped into a register queue to hold the value across several iterations. 22 Results Register Savings As latency grows for the instructions more iterations of the loop are extracted to spread out the latency. The extra registers that would be required to perform renaming have measured from 25% to 200% of the available registers in the ARM. 23

Recently Viewed Presentations

  • Accidental Fire Cause - NCDOI

    Accidental Fire Cause - NCDOI

    R. Light bulbs as fire cause. 1. Light bulbs may serve as a potential ignition source in certain situations. 2. High wattage bulbs may ignite combustible materials nearby depending upon the duration of heating and the ignition properties of the...
  • Career in Educational Psychology Dr. Kathy Shum Department

    Career in Educational Psychology Dr. Kathy Shum Department

    Support for school administrators. Consultation on school-based policy. Accommodation and support for SEN. Acceleration. Planning and implementation of developmental and preventive measures
  • The e-tailing group/PowerReviews 1st Annual Community and Social

    The e-tailing group/PowerReviews 1st Annual Community and Social

    The e-tailing group/PowerReviews 1st Annual Community and Social Media Survey Prepared by the e-tailing group September, 2009 Topline Findings Merchants & Brands greatest concerns about social media* trends today center on people's ability to trash their products in front of...
  • Master Servicing Performance Tracking Sample Template 1 Sample

    Master Servicing Performance Tracking Sample Template 1 Sample

    This template is a sample scorecard you may use to help guide your review of your subservicers' performance. As master servicer, you remain liable to Fannie Mae for the performance of all servicing obligations.
  • Percent Fail SVTs by High/Low FBS-r/RBS/Fs Score Levels (T ...

    Percent Fail SVTs by High/Low FBS-r/RBS/Fs Score Levels (T ...

    FBS(-r) and RBS scores above the cutoffs of T 80 and T 90 were associated with the highest rates of SVT failure. Using an FS cutoff of 6 or higher also increased the observed rate of SVT failure. Figure 1...
  • Transformations - NADAWM

    Transformations - NADAWM

    Julian Hubbard. Director of Ministry. Lis Goddard. Transformations Group. Tim Ling. National Adviser, Continuing Ministry Development. Su Morgan. HR Director. TRIGTransformations Research &Implementation Group. oversee the research on aspects of women's ministry. consider proposals for the ...
  • Energy Transfer During Exercise - Weber State University

    Energy Transfer During Exercise - Weber State University

    Oxygen Uptake during Recovery Traditional "Oxygen Debt" Theory Alactacid oxygen debt: restoration of ATP & PCr depleted during exercise, small portion to reload muscle myoglobin & hemoglobin [fast].
  • Diapositiva 1 - un.org

    Diapositiva 1 - un.org

    Consultor OREALC/UNESCO [email protected] * En efecto, se trata de jóvenes que no han conocido el mundo sin Internet, y para los cuales las tecnologías digitales son mediadoras de gran ...