SlideShare a Scribd company logo
RETROSPECTIVE:



   Improving Register Allocation for Subscripted Variables

                  David Callahan                                Steve Carr                           Ken Kennedy
                     Cray, Inc.                         Department of Computer                 Department of Computer
          411 First Avenue S., Suite 600                        Science                               Science
            Seattle, WA 98104-2860                       Michigan Technological                    Rice University
                 david@cray.com                                University                      Houston, TX 77005-1892
                                                        Houghton MI 49931-1295                      ken@rice.edu
                                                             carr@mtu.edu

1. INTRODUCTION                                                         placement, mostly as a way of achieving register allocation at the
By the late 1980s, memory system performance and CPU perfor-            source-to-source level. Most compilers at that time, and certainly
mance had already begun to diverge. This trend made effective use       the RS-6000 compiler, did a very good job of allocating scalars to
of the register file imperative for excellent performance. Although      registers. Our thinking was that by copying array elements into
most compilers at that time allocated scalar variables to registers     scalars and then operating on the scalar quantities, we would en-
using graph coloring with marked success [12, 13, 14, 6], alloca-       courage the compiler to keep the array elements in registers be-
tion of array values to registers only occurred in rare circumstances   tween uses. Dependence analysis was the key to identifying the
because standard data-flow analysis techniques could not uncover         uses that were candidates for this transformation.
the available reuse of array memory locations. This deficiency was          Dependence analysis was developed beginning in the early 1970s
especially problematic for scientific codes since a majority of the      to support automatic parallelization and vectorization. The earliest
computation involves array references.                                  papers referring to the underlying ideas were by Kuck, Muraoka
   Our original paper addressed this problem by presenting an al-       and Chen [19] and Lamport [20]. Kuck, et al. [18, 17, 22, 27],
gorithm and experiment for a loop transformation, called scalar         and Allen, et al. [3, 4], formalized dependence and applied it to
replacement, that exposed the reuse available in array references in    parallelization and vectorization.
an innermost loop. It also demonstrated experimentally how an-             In parallelization and vectorization, dependences restrict the par-
other loop transformation, called unroll-and-jam [2], could expose      allelism that can be extracted from a loop. Hence, many loop trans-
more opportunities for scalar replacement by moving reuse occur-        formations, such as loop interchange [27, 3, 4], loop skewing [28],
ring across an outer loop into the innermost loop. The key contri-      node splitting [18] and loop distribution [3, 4], attempt to modify
bution of this work was to demonstrate that the benefits of the very     or rearrange dependences so that parallelism can be extracted from
successful register allocation strategies developed for RISC proces-    a loop nest. In essence, the compiler restructures a loop nest so that
sors could be extended to subscripted variables through the use of      some loop within the nest has no dependence between successive
array dependence analysis. This approach and its descendants have       iterations, and hence can be parallelized.
led to substantive, and in some cases dramatic, improvements in the        Unlike vectorization and parallelization, scalar replacement and
performance of scientific programs on machines with long memory          unroll-and-jam do not rearrange dependences to allow loop paral-
latencies.                                                              lelism. Instead these transformations use dependences to identify
   In the remainder of this retrospective, we review the major in-      data reuse. If a data element can be kept in a register between the
fluences that resulted in the development of scalar replacement and      instruction at the source of the dependence and its sink, a mem-
unroll-and-jam and influence that our paper had on later work, in-       ory access can be avoided. In addition, dependence can be used
cluding commercial compiler implementations.                            to improve instruction-level parallelism through the unroll-and-jam
                                                                        transformation. Hence, in this paradigm, dependences represent
2. BACKGROUND                                                           opportunities rather than constraints.
                                                                           Perhaps the first to take advantage of the reuse that dependences
In 1987, John Cocke, on behalf of IBM, approached Rice to carry         encapsulate was Abu-Sufah [1], who used dependence to improve
out a research project on register allocation focused on a new RISC     the performance of virtual memory. Vector register allocation [4,
machine that would eventually become the RS-6000. He knew that          5] used dependence to determine when vectors of data were reused
we had a source-to-source program transformation system that had        and could be kept in vector registers. In this context, scalar replace-
a strong data dependence analyzer built in. He believed that this       ment is vector register allocation for vectors of size one.
would be needed to implement unroll-and-jam and similar loop re-           As we indicated earlier, unroll-and-jam was not new. Allen and
structuring transformations to improve memory hierarchy perfor-         Cocke [2] defined it in their famous catalog. However, scalar re-
mance on the new system. Such strategies were critical because the      placement represented a new strategy, albeit one that was frequently
RS-6000 would have memory latencies as large as 25 cycles, which        employed in hand coding. Our first concept paper, by Callahan,
was staggering in those days.                                           Cocke and Kennedy [7], presented scalar replacement and unroll-
   As a part of this effort, we developed the concept of scalar re-     and-jam as strategies to reduce pipeline interlock and improve the
                                                                        balance between memory accesses and floating-point computation.
20 Years of the ACM/SIGPLAN Conference on Programming Language          The paper in this volume reported on the implementation and ex-
Design and Implementation (1979-1999): A Selection, 2003.               perimental validation of these techniques.
Copyright 2003 ACM 1-58113-623-4 ...$5.00.
3.    INFLUENCE                                                              [8] S. Carr. Combining optimization for cache and instruction-level
Since our paper appeared in SIGPLAN PLDI ’90, many extensions                    parallelism. In Proceedings of the 1996 Conference on Parallel
                                                                                 Architectures and Compiler Techniques, pages 238–247, Boston,
have been added to both scalar replacement and unroll-and-jam. In                MA, Oct. 1996.
this section, we outline a number of papers that are most closely            [9] S. Carr and Y. Guan. Unroll-and-jam using uniformly generated sets.
related and influenced by our work. This list is only a sample of                 In Proceedings of the 30th International Symposium on
papers and is by no means exhaustive.                                            Microarchitecture (MICRO-30), Research Triangle Park, NC, Dec.
   Carr and Kennedy [11] extended the original algorithm to apply                1997.
scalar replacement to loops that contain conditional control state-         [10] S. Carr and K. Kennedy. Improving the ratio of memory operations
ments by using a combination of dependence analysis and partial                  to floating-point operations in loops. ACM Transactions on
                                                                                 Programming Languages and Systems, 16(6):1768–1810, 1994.
redundancy elimination [15]. Deusterwald, Gupta and Soffa [16]
                                                                            [11] S. Carr and K. Kennedy. Scalar replacement in the presence of
developed a data-flow analysis framework to apply scalar replace-                 conditional control flow. Software – Practice & Experience,
ment in loops with control statements.                                           24(1):51–77, Jan. 1994.
   Wolf and Lam [25] used unroll-and-jam in the context of their            [12] G. Chaitin, M. Auslander, A. Chandra, J. Cocke, M. Hopkins, and
data locality optimizations; they referred to this as “register tiling.”         P. Markstein. Register allocation via coloring. Computer Languages,
Carr and Kennedy [10] showed how to compute the unroll factors                   6:45–57, Jan. 1981.
for unroll-and-jam from the dependence graph with the objective             [13] G. J. Chaitin. Register allocation and spilling via graph coloring. In
of improving loop balance. Carr [8], Wolf, Maydan and Chen [26],                 Proceedings of the ACM SIGPLAN ’82 Symposium on Compiler
                                                                                 Construction, pages 98–105, Boston, MA, June 1982.
and Carr and Guan [9] combined optimization for instruction-level
                                                                            [14] F. C. Chow and J. L. Hennessy. Register allocation by priority-based
parallelism and cache using unroll-and-jam and scalar replacement.               coloring. In Proceedings of the ACM SIGPLAN ’84 Symposium on
Sarkar [24] used a local instruction scheduling based model to com-              Compiler Construction, pages 222–232, Montreal, Quebec, June
pute unroll-and-jam amounts. Qian, Carr and Sweany [23] de-                      1984.
veloped a performance model based upon software pipelining that             [15] K.-H. Drechsler and M. P. Stadel. A solution to a problem with
included intercluster register copies for clustered VLIW architec-               Morel and Renvoise’s “Global optimization by suppression of partial
tures.                                                                           redundancies”. ACM Transactions on Programming Languages and
   Since reuse of array values in registers can be captured by de-               Systems, 10(4):635–640, Oct. 1988.
                                                                            [16] E. Duesterwald, R. Gupta, and M. L. Soffa. A practical data flow
pendence information, it is only natural to expand the use of de-
                                                                                 framework for array reference analysis and its use in optimizations.
pendence to improve performance of the data cache. McKinley,                     In Proceedings of the ACM SIGPLAN ’93 Conference on
Carr and Tseng [21] use dependence analysis directly to determine                Programming Language Design and Implementation, pages 68–77,
the cache behavior of loops and apply loop permutation and loop                  Albuquerque, NM, June 1993.
fusion to improve locality.                                                 [17] D. Kuck, R. Kuhn, B. Leasure, and M. Wolfe. The structure of an
   Many commercial compilers today incorporate both scalar re-                   advanced retargetable vectorizer. In Supercomputers: Design and
placement and unroll-and-jam in some form. We are aware of im-                   Applications, pages 163–178. IEEE Computer Society Press, Silver
                                                                                 Spring, MD., 1984.
plementations of these optimizations in compilers for the follow-
                                                                            [18] D. Kuck, R. Kuhn, D. Padua, B. Leasure, and M. Wolfe. Dependence
ing architectures: Texas Instruments TMS320C6x, Compaq Alpha,                    graphs and compiler optimizations. In Conference Record of the
MIPS R10000, and Intel IA-64.                                                    Eight ACM Symposium on the Principles of Programming
                                                                                 Languages, 1981.
                                                                            [19] D. Kuck, Y. Muraoka, and S. Chen. On the number of operations
ACKNOWLEDGMENTS                                                                  simultaneously executable in fortran-like programs and their
The authors would like to express our sincerest gratitude to the late            resulting speedup. IEEE Transactions on Computers,
John Cocke, who inspired and funded this work while at IBM. His                  C-21(12):1293–1310, Dec. 1972.
accomplishments made him a pioneer not only of computer archi-              [20] L. Lamport. The parallel execution of DO-loops. Communications of
                                                                                 the ACM, 17(2):83–93, 1974.
tecture but also in compiler optimization.
                                                                            [21] K. S. McKinley, S. Carr, and C.-W. Tseng. Improving data locality
                                                                                 with loop transformations. ACM Transactions on Programming
REFERENCES                                                                       Languages and Systems, 18(4):424–453, 1996.
                                                                            [22] D. A. Padua and M. J. Wolfe. Advanced compiler optimizations for
 [1] W. Abu-Sufah. Improving the Performance of Virtual Memory                   supercomputers. Communications of the ACM, 29(12):1184–1201,
     Computers. PhD thesis, University of Illinois, 1978.                        Dec. 1986.
 [2] F. E. Allen and J. Cocke. A catalogue of optimizing transformations.   [23] Y. Qian, S. Carr, and P. Sweany. Optimizing loop performance for
     In Design and Optimization of Compilers, pages 1–30. Prentice-Hall,         clustered vliw architectures. In Proceedings of the 2002 International
     1972.                                                                       Conference on Parallel Architectures and Compilation Techniques,
 [3] J. Allen, D. Callahan, and K. Kennedy. Automatic decomposition of           pages 271–280, Charlottesville, VA, Sept. 2002.
     scientific programs for parallel execution. In Conference Record of     [24] V. Sarkar. Optimized unrolling of nested loops. In Proceedings of the
     the Fourteenth ACM Symposium on the Principles of Programming               2000 International Conference on Supercomputing, pages 153–166,
     Languages, Munich, West Germany, Jan. 1987.                                 Sante Fe, NM, May 2000.
 [4] J. Allen and K. Kennedy. Automatic translation of Fortran programs     [25] M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In
     to vector form. ACM Transactions on Programming Languages and               Proceedings of the ACM SIGPLAN ’91 Conference on Programming
     Systems, 9(4):491–542, Oct. 1987.                                           Language Design and Implementation, pages 30–44, Toronto,
 [5] J. Allen and K. Kennedy. Vector register allocation. IEEE                   Ontario, June 1991.
     Transactions on Computers, 41(10):1290 – 1317, Oct. 1992.              [26] M. E. Wolf, D. E. Maydan, and D.-K. Chen. Combining loop
 [6] P. Briggs, K. D. Cooper, K. Kennedy, and L. Torczon. Coloring               transformations considering caches and scheduling. In Twenty-Ninth
     heuristics for register allocation. In Proceedings of the ACM               Annual Symposium on Micorarchitecture (MICRO-29), Dec. 1996.
     SIGPLAN ’89 Conference on Programming Language Design and              [27] M. Wolfe. Advanced loop interchange. In Proceedings of the 1986
     Implementation, pages 275–284, Portland, OR, July 1989.                     International Conference on Parallel Processing, Aug. 1986.
 [7] D. Callahan, J. Cocke, and K. Kennedy. Estimating interlock and        [28] M. Wolfe. Loop skewing: The wavefront method revisited. Journal
     improving balance for pipelined machines. Journal of Parallel and           of Parallel Programming, 15(4):279–293, Aug. 1986.
     Distributed Computing, 5:334–358, 1988.

More Related Content

Similar to Improving Register Allocation For Subscripted Variables (20)

PDF
Recursive
Alexander Cave
 
PPTX
Comparative Analysis of RMSE and MAP Metrices for Evaluating CNN and LSTM Mod...
GagandeepKaur872517
 
PDF
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET Journal
 
PDF
SCALABLE SEMI-SUPERVISED LEARNING BY EFFICIENT ANCHOR GRAPH REGULARIZATION
Nexgen Technology
 
PPTX
OpenACC Monthly Highlights Summer 2019
OpenACC
 
DOC
Resumen proyecto
Junior Quiroga Tenorio
 
PDF
RESOURCE-SAVING FILE MANAGEMENT SCHEME FOR ONLINE VIDEO PROVISIONING ON CONTE...
Nexgen Technology
 
PDF
FScaler: Automatic Resource Scaling of Containers in Fog Clusters Using Reinf...
Hani Sami
 
PDF
Accelerating sparse matrix-vector multiplication in iterative methods using GPU
Subhajit Sahu
 
PDF
Ieeepro techno solutions 2014 ieee java project - deadline based resource p...
hemanthbbc
 
PDF
Ieeepro techno solutions 2014 ieee dotnet project - deadline based resource...
ASAITHAMBIRAJAA
 
PDF
Ieeepro techno solutions 2014 ieee dotnet project - deadline based resource...
ASAITHAMBIRAJAA
 
PDF
Workflow Scheduling Techniques and Algorithms in IaaS Cloud: A Survey
IJECEIAES
 
PDF
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
IRJET Journal
 
PDF
Baroclinic Channel Model in Fluid Dynamics
IJERA Editor
 
PDF
D0212326
inventionjournals
 
PDF
EFFICIENT R-TREE BASED INDEXING SCHEME FOR SERVER-CENTRIC CLOUD STORAGE SYSTEM
Nexgen Technology
 
PDF
% Razones para alta disponibilidad
CPVEN
 
PDF
Application of Machine Learning Techniques in Predicting the Bearing Capacity...
Journal of Soft Computing in Civil Engineering
 
PDF
Parallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Subhajit Sahu
 
Recursive
Alexander Cave
 
Comparative Analysis of RMSE and MAP Metrices for Evaluating CNN and LSTM Mod...
GagandeepKaur872517
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET Journal
 
SCALABLE SEMI-SUPERVISED LEARNING BY EFFICIENT ANCHOR GRAPH REGULARIZATION
Nexgen Technology
 
OpenACC Monthly Highlights Summer 2019
OpenACC
 
Resumen proyecto
Junior Quiroga Tenorio
 
RESOURCE-SAVING FILE MANAGEMENT SCHEME FOR ONLINE VIDEO PROVISIONING ON CONTE...
Nexgen Technology
 
FScaler: Automatic Resource Scaling of Containers in Fog Clusters Using Reinf...
Hani Sami
 
Accelerating sparse matrix-vector multiplication in iterative methods using GPU
Subhajit Sahu
 
Ieeepro techno solutions 2014 ieee java project - deadline based resource p...
hemanthbbc
 
Ieeepro techno solutions 2014 ieee dotnet project - deadline based resource...
ASAITHAMBIRAJAA
 
Ieeepro techno solutions 2014 ieee dotnet project - deadline based resource...
ASAITHAMBIRAJAA
 
Workflow Scheduling Techniques and Algorithms in IaaS Cloud: A Survey
IJECEIAES
 
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
IRJET Journal
 
Baroclinic Channel Model in Fluid Dynamics
IJERA Editor
 
EFFICIENT R-TREE BASED INDEXING SCHEME FOR SERVER-CENTRIC CLOUD STORAGE SYSTEM
Nexgen Technology
 
% Razones para alta disponibilidad
CPVEN
 
Application of Machine Learning Techniques in Predicting the Bearing Capacity...
Journal of Soft Computing in Civil Engineering
 
Parallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Subhajit Sahu
 

Recently uploaded (20)

PDF
Market Wrap for 18th July 2025 by CIFDAQ
CIFDAQ
 
PDF
Are there government-backed agri-software initiatives in Limerick.pdf
giselawagner2
 
PPTX
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
PPTX
The Yotta x CloudStack Advantage: Scalable, India-First Cloud
ShapeBlue
 
PPTX
Darren Mills The Migration Modernization Balancing Act: Navigating Risks and...
AWS Chicago
 
PDF
Apache CloudStack 201: Let's Design & Build an IaaS Cloud
ShapeBlue
 
PDF
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
PPTX
Simplifying End-to-End Apache CloudStack Deployment with a Web-Based Automati...
ShapeBlue
 
PDF
Novus Safe Lite- What is Novus Safe Lite.pdf
Novus Hi-Tech
 
PPT
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
Bitcoin+ Escalando sin concesiones - Parte 1
Fernando Paredes García
 
PPTX
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
PDF
Upskill to Agentic Automation 2025 - Kickoff Meeting
DianaGray10
 
PDF
Impact of IEEE Computer Society in Advancing Emerging Technologies including ...
Hironori Washizaki
 
PPTX
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
PDF
Human-centred design in online workplace learning and relationship to engagem...
Tracy Tang
 
Market Wrap for 18th July 2025 by CIFDAQ
CIFDAQ
 
Are there government-backed agri-software initiatives in Limerick.pdf
giselawagner2
 
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
The Yotta x CloudStack Advantage: Scalable, India-First Cloud
ShapeBlue
 
Darren Mills The Migration Modernization Balancing Act: Navigating Risks and...
AWS Chicago
 
Apache CloudStack 201: Let's Design & Build an IaaS Cloud
ShapeBlue
 
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
Simplifying End-to-End Apache CloudStack Deployment with a Web-Based Automati...
ShapeBlue
 
Novus Safe Lite- What is Novus Safe Lite.pdf
Novus Hi-Tech
 
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Bitcoin+ Escalando sin concesiones - Parte 1
Fernando Paredes García
 
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
Upskill to Agentic Automation 2025 - Kickoff Meeting
DianaGray10
 
Impact of IEEE Computer Society in Advancing Emerging Technologies including ...
Hironori Washizaki
 
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
Human-centred design in online workplace learning and relationship to engagem...
Tracy Tang
 
Ad

Improving Register Allocation For Subscripted Variables

  • 1. RETROSPECTIVE: Improving Register Allocation for Subscripted Variables David Callahan Steve Carr Ken Kennedy Cray, Inc. Department of Computer Department of Computer 411 First Avenue S., Suite 600 Science Science Seattle, WA 98104-2860 Michigan Technological Rice University [email protected] University Houston, TX 77005-1892 Houghton MI 49931-1295 [email protected] [email protected] 1. INTRODUCTION placement, mostly as a way of achieving register allocation at the By the late 1980s, memory system performance and CPU perfor- source-to-source level. Most compilers at that time, and certainly mance had already begun to diverge. This trend made effective use the RS-6000 compiler, did a very good job of allocating scalars to of the register file imperative for excellent performance. Although registers. Our thinking was that by copying array elements into most compilers at that time allocated scalar variables to registers scalars and then operating on the scalar quantities, we would en- using graph coloring with marked success [12, 13, 14, 6], alloca- courage the compiler to keep the array elements in registers be- tion of array values to registers only occurred in rare circumstances tween uses. Dependence analysis was the key to identifying the because standard data-flow analysis techniques could not uncover uses that were candidates for this transformation. the available reuse of array memory locations. This deficiency was Dependence analysis was developed beginning in the early 1970s especially problematic for scientific codes since a majority of the to support automatic parallelization and vectorization. The earliest computation involves array references. papers referring to the underlying ideas were by Kuck, Muraoka Our original paper addressed this problem by presenting an al- and Chen [19] and Lamport [20]. Kuck, et al. [18, 17, 22, 27], gorithm and experiment for a loop transformation, called scalar and Allen, et al. [3, 4], formalized dependence and applied it to replacement, that exposed the reuse available in array references in parallelization and vectorization. an innermost loop. It also demonstrated experimentally how an- In parallelization and vectorization, dependences restrict the par- other loop transformation, called unroll-and-jam [2], could expose allelism that can be extracted from a loop. Hence, many loop trans- more opportunities for scalar replacement by moving reuse occur- formations, such as loop interchange [27, 3, 4], loop skewing [28], ring across an outer loop into the innermost loop. The key contri- node splitting [18] and loop distribution [3, 4], attempt to modify bution of this work was to demonstrate that the benefits of the very or rearrange dependences so that parallelism can be extracted from successful register allocation strategies developed for RISC proces- a loop nest. In essence, the compiler restructures a loop nest so that sors could be extended to subscripted variables through the use of some loop within the nest has no dependence between successive array dependence analysis. This approach and its descendants have iterations, and hence can be parallelized. led to substantive, and in some cases dramatic, improvements in the Unlike vectorization and parallelization, scalar replacement and performance of scientific programs on machines with long memory unroll-and-jam do not rearrange dependences to allow loop paral- latencies. lelism. Instead these transformations use dependences to identify In the remainder of this retrospective, we review the major in- data reuse. If a data element can be kept in a register between the fluences that resulted in the development of scalar replacement and instruction at the source of the dependence and its sink, a mem- unroll-and-jam and influence that our paper had on later work, in- ory access can be avoided. In addition, dependence can be used cluding commercial compiler implementations. to improve instruction-level parallelism through the unroll-and-jam transformation. Hence, in this paradigm, dependences represent 2. BACKGROUND opportunities rather than constraints. Perhaps the first to take advantage of the reuse that dependences In 1987, John Cocke, on behalf of IBM, approached Rice to carry encapsulate was Abu-Sufah [1], who used dependence to improve out a research project on register allocation focused on a new RISC the performance of virtual memory. Vector register allocation [4, machine that would eventually become the RS-6000. He knew that 5] used dependence to determine when vectors of data were reused we had a source-to-source program transformation system that had and could be kept in vector registers. In this context, scalar replace- a strong data dependence analyzer built in. He believed that this ment is vector register allocation for vectors of size one. would be needed to implement unroll-and-jam and similar loop re- As we indicated earlier, unroll-and-jam was not new. Allen and structuring transformations to improve memory hierarchy perfor- Cocke [2] defined it in their famous catalog. However, scalar re- mance on the new system. Such strategies were critical because the placement represented a new strategy, albeit one that was frequently RS-6000 would have memory latencies as large as 25 cycles, which employed in hand coding. Our first concept paper, by Callahan, was staggering in those days. Cocke and Kennedy [7], presented scalar replacement and unroll- As a part of this effort, we developed the concept of scalar re- and-jam as strategies to reduce pipeline interlock and improve the balance between memory accesses and floating-point computation. 20 Years of the ACM/SIGPLAN Conference on Programming Language The paper in this volume reported on the implementation and ex- Design and Implementation (1979-1999): A Selection, 2003. perimental validation of these techniques. Copyright 2003 ACM 1-58113-623-4 ...$5.00.
  • 2. 3. INFLUENCE [8] S. Carr. Combining optimization for cache and instruction-level Since our paper appeared in SIGPLAN PLDI ’90, many extensions parallelism. In Proceedings of the 1996 Conference on Parallel Architectures and Compiler Techniques, pages 238–247, Boston, have been added to both scalar replacement and unroll-and-jam. In MA, Oct. 1996. this section, we outline a number of papers that are most closely [9] S. Carr and Y. Guan. Unroll-and-jam using uniformly generated sets. related and influenced by our work. This list is only a sample of In Proceedings of the 30th International Symposium on papers and is by no means exhaustive. Microarchitecture (MICRO-30), Research Triangle Park, NC, Dec. Carr and Kennedy [11] extended the original algorithm to apply 1997. scalar replacement to loops that contain conditional control state- [10] S. Carr and K. Kennedy. Improving the ratio of memory operations ments by using a combination of dependence analysis and partial to floating-point operations in loops. ACM Transactions on Programming Languages and Systems, 16(6):1768–1810, 1994. redundancy elimination [15]. Deusterwald, Gupta and Soffa [16] [11] S. Carr and K. Kennedy. Scalar replacement in the presence of developed a data-flow analysis framework to apply scalar replace- conditional control flow. Software – Practice & Experience, ment in loops with control statements. 24(1):51–77, Jan. 1994. Wolf and Lam [25] used unroll-and-jam in the context of their [12] G. Chaitin, M. Auslander, A. Chandra, J. Cocke, M. Hopkins, and data locality optimizations; they referred to this as “register tiling.” P. Markstein. Register allocation via coloring. Computer Languages, Carr and Kennedy [10] showed how to compute the unroll factors 6:45–57, Jan. 1981. for unroll-and-jam from the dependence graph with the objective [13] G. J. Chaitin. Register allocation and spilling via graph coloring. In of improving loop balance. Carr [8], Wolf, Maydan and Chen [26], Proceedings of the ACM SIGPLAN ’82 Symposium on Compiler Construction, pages 98–105, Boston, MA, June 1982. and Carr and Guan [9] combined optimization for instruction-level [14] F. C. Chow and J. L. Hennessy. Register allocation by priority-based parallelism and cache using unroll-and-jam and scalar replacement. coloring. In Proceedings of the ACM SIGPLAN ’84 Symposium on Sarkar [24] used a local instruction scheduling based model to com- Compiler Construction, pages 222–232, Montreal, Quebec, June pute unroll-and-jam amounts. Qian, Carr and Sweany [23] de- 1984. veloped a performance model based upon software pipelining that [15] K.-H. Drechsler and M. P. Stadel. A solution to a problem with included intercluster register copies for clustered VLIW architec- Morel and Renvoise’s “Global optimization by suppression of partial tures. redundancies”. ACM Transactions on Programming Languages and Since reuse of array values in registers can be captured by de- Systems, 10(4):635–640, Oct. 1988. [16] E. Duesterwald, R. Gupta, and M. L. Soffa. A practical data flow pendence information, it is only natural to expand the use of de- framework for array reference analysis and its use in optimizations. pendence to improve performance of the data cache. McKinley, In Proceedings of the ACM SIGPLAN ’93 Conference on Carr and Tseng [21] use dependence analysis directly to determine Programming Language Design and Implementation, pages 68–77, the cache behavior of loops and apply loop permutation and loop Albuquerque, NM, June 1993. fusion to improve locality. [17] D. Kuck, R. Kuhn, B. Leasure, and M. Wolfe. The structure of an Many commercial compilers today incorporate both scalar re- advanced retargetable vectorizer. In Supercomputers: Design and placement and unroll-and-jam in some form. We are aware of im- Applications, pages 163–178. IEEE Computer Society Press, Silver Spring, MD., 1984. plementations of these optimizations in compilers for the follow- [18] D. Kuck, R. Kuhn, D. Padua, B. Leasure, and M. Wolfe. Dependence ing architectures: Texas Instruments TMS320C6x, Compaq Alpha, graphs and compiler optimizations. In Conference Record of the MIPS R10000, and Intel IA-64. Eight ACM Symposium on the Principles of Programming Languages, 1981. [19] D. Kuck, Y. Muraoka, and S. Chen. On the number of operations ACKNOWLEDGMENTS simultaneously executable in fortran-like programs and their The authors would like to express our sincerest gratitude to the late resulting speedup. IEEE Transactions on Computers, John Cocke, who inspired and funded this work while at IBM. His C-21(12):1293–1310, Dec. 1972. accomplishments made him a pioneer not only of computer archi- [20] L. Lamport. The parallel execution of DO-loops. Communications of the ACM, 17(2):83–93, 1974. tecture but also in compiler optimization. [21] K. S. McKinley, S. Carr, and C.-W. Tseng. Improving data locality with loop transformations. ACM Transactions on Programming REFERENCES Languages and Systems, 18(4):424–453, 1996. [22] D. A. Padua and M. J. Wolfe. Advanced compiler optimizations for [1] W. Abu-Sufah. Improving the Performance of Virtual Memory supercomputers. Communications of the ACM, 29(12):1184–1201, Computers. PhD thesis, University of Illinois, 1978. Dec. 1986. [2] F. E. Allen and J. Cocke. A catalogue of optimizing transformations. [23] Y. Qian, S. Carr, and P. Sweany. Optimizing loop performance for In Design and Optimization of Compilers, pages 1–30. Prentice-Hall, clustered vliw architectures. In Proceedings of the 2002 International 1972. Conference on Parallel Architectures and Compilation Techniques, [3] J. Allen, D. Callahan, and K. Kennedy. Automatic decomposition of pages 271–280, Charlottesville, VA, Sept. 2002. scientific programs for parallel execution. In Conference Record of [24] V. Sarkar. Optimized unrolling of nested loops. In Proceedings of the the Fourteenth ACM Symposium on the Principles of Programming 2000 International Conference on Supercomputing, pages 153–166, Languages, Munich, West Germany, Jan. 1987. Sante Fe, NM, May 2000. [4] J. Allen and K. Kennedy. Automatic translation of Fortran programs [25] M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In to vector form. ACM Transactions on Programming Languages and Proceedings of the ACM SIGPLAN ’91 Conference on Programming Systems, 9(4):491–542, Oct. 1987. Language Design and Implementation, pages 30–44, Toronto, [5] J. Allen and K. Kennedy. Vector register allocation. IEEE Ontario, June 1991. Transactions on Computers, 41(10):1290 – 1317, Oct. 1992. [26] M. E. Wolf, D. E. Maydan, and D.-K. Chen. Combining loop [6] P. Briggs, K. D. Cooper, K. Kennedy, and L. Torczon. Coloring transformations considering caches and scheduling. In Twenty-Ninth heuristics for register allocation. In Proceedings of the ACM Annual Symposium on Micorarchitecture (MICRO-29), Dec. 1996. SIGPLAN ’89 Conference on Programming Language Design and [27] M. Wolfe. Advanced loop interchange. In Proceedings of the 1986 Implementation, pages 275–284, Portland, OR, July 1989. International Conference on Parallel Processing, Aug. 1986. [7] D. Callahan, J. Cocke, and K. Kennedy. Estimating interlock and [28] M. Wolfe. Loop skewing: The wavefront method revisited. Journal improving balance for pipelined machines. Journal of Parallel and of Parallel Programming, 15(4):279–293, Aug. 1986. Distributed Computing, 5:334–358, 1988.