Improving Register Allocation For Subscripted Variables

RETROSPECTIVE:

Improving Register Allocation for Subscripted Variables

David Callahan Steve Carr Ken Kennedy
Cray, Inc. Department of Computer Department of Computer
411 First Avenue S., Suite 600 Science Science
Seattle, WA 98104-2860 Michigan Technological Rice University
david@cray.com University Houston, TX 77005-1892
Houghton MI 49931-1295 ken@rice.edu
carr@mtu.edu

1. INTRODUCTION placement, mostly as a way of achieving register allocation at the
By the late 1980s, memory system performance and CPU perfor- source-to-source level. Most compilers at that time, and certainly
mance had already begun to diverge. This trend made effective use the RS-6000 compiler, did a very good job of allocating scalars to
of the register file imperative for excellent performance. Although registers. Our thinking was that by copying array elements into
most compilers at that time allocated scalar variables to registers scalars and then operating on the scalar quantities, we would en-
using graph coloring with marked success [12, 13, 14, 6], alloca- courage the compiler to keep the array elements in registers be-
tion of array values to registers only occurred in rare circumstances tween uses. Dependence analysis was the key to identifying the
because standard data-flow analysis techniques could not uncover uses that were candidates for this transformation.
the available reuse of array memory locations. This deficiency was Dependence analysis was developed beginning in the early 1970s
especially problematic for scientific codes since a majority of the to support automatic parallelization and vectorization. The earliest
computation involves array references. papers referring to the underlying ideas were by Kuck, Muraoka
Our original paper addressed this problem by presenting an al- and Chen [19] and Lamport [20]. Kuck, et al. [18, 17, 22, 27],
gorithm and experiment for a loop transformation, called scalar and Allen, et al. [3, 4], formalized dependence and applied it to
replacement, that exposed the reuse available in array references in parallelization and vectorization.
an innermost loop. It also demonstrated experimentally how an- In parallelization and vectorization, dependences restrict the par-
other loop transformation, called unroll-and-jam [2], could expose allelism that can be extracted from a loop. Hence, many loop trans-
more opportunities for scalar replacement by moving reuse occur- formations, such as loop interchange [27, 3, 4], loop skewing [28],
ring across an outer loop into the innermost loop. The key contri- node splitting [18] and loop distribution [3, 4], attempt to modify
bution of this work was to demonstrate that the benefits of the very or rearrange dependences so that parallelism can be extracted from
successful register allocation strategies developed for RISC proces- a loop nest. In essence, the compiler restructures a loop nest so that
sors could be extended to subscripted variables through the use of some loop within the nest has no dependence between successive
array dependence analysis. This approach and its descendants have iterations, and hence can be parallelized.
led to substantive, and in some cases dramatic, improvements in the Unlike vectorization and parallelization, scalar replacement and
performance of scientific programs on machines with long memory unroll-and-jam do not rearrange dependences to allow loop paral-
latencies. lelism. Instead these transformations use dependences to identify
In the remainder of this retrospective, we review the major in- data reuse. If a data element can be kept in a register between the
fluences that resulted in the development of scalar replacement and instruction at the source of the dependence and its sink, a mem-
unroll-and-jam and influence that our paper had on later work, in- ory access can be avoided. In addition, dependence can be used
cluding commercial compiler implementations. to improve instruction-level parallelism through the unroll-and-jam
transformation. Hence, in this paradigm, dependences represent
2. BACKGROUND opportunities rather than constraints.
Perhaps the first to take advantage of the reuse that dependences
In 1987, John Cocke, on behalf of IBM, approached Rice to carry encapsulate was Abu-Sufah [1], who used dependence to improve
out a research project on register allocation focused on a new RISC the performance of virtual memory. Vector register allocation [4,
machine that would eventually become the RS-6000. He knew that 5] used dependence to determine when vectors of data were reused
we had a source-to-source program transformation system that had and could be kept in vector registers. In this context, scalar replace-
a strong data dependence analyzer built in. He believed that this ment is vector register allocation for vectors of size one.
would be needed to implement unroll-and-jam and similar loop re- As we indicated earlier, unroll-and-jam was not new. Allen and
structuring transformations to improve memory hierarchy perfor- Cocke [2] defined it in their famous catalog. However, scalar re-
mance on the new system. Such strategies were critical because the placement represented a new strategy, albeit one that was frequently
RS-6000 would have memory latencies as large as 25 cycles, which employed in hand coding. Our first concept paper, by Callahan,
was staggering in those days. Cocke and Kennedy [7], presented scalar replacement and unroll-
As a part of this effort, we developed the concept of scalar re- and-jam as strategies to reduce pipeline interlock and improve the
balance between memory accesses and floating-point computation.
20 Years of the ACM/SIGPLAN Conference on Programming Language The paper in this volume reported on the implementation and ex-
Design and Implementation (1979-1999): A Selection, 2003. perimental validation of these techniques.
Copyright 2003 ACM 1-58113-623-4 ...$5.00.

3. INFLUENCE [8] S. Carr. Combining optimization for cache and instruction-level
Since our paper appeared in SIGPLAN PLDI ’90, many extensions parallelism. In Proceedings of the 1996 Conference on Parallel
Architectures and Compiler Techniques, pages 238–247, Boston,
have been added to both scalar replacement and unroll-and-jam. In MA, Oct. 1996.
this section, we outline a number of papers that are most closely [9] S. Carr and Y. Guan. Unroll-and-jam using uniformly generated sets.
related and influenced by our work. This list is only a sample of In Proceedings of the 30th International Symposium on
papers and is by no means exhaustive. Microarchitecture (MICRO-30), Research Triangle Park, NC, Dec.
Carr and Kennedy [11] extended the original algorithm to apply 1997.
scalar replacement to loops that contain conditional control state- [10] S. Carr and K. Kennedy. Improving the ratio of memory operations
ments by using a combination of dependence analysis and partial to floating-point operations in loops. ACM Transactions on
Programming Languages and Systems, 16(6):1768–1810, 1994.
redundancy elimination [15]. Deusterwald, Gupta and Soffa [16]
[11] S. Carr and K. Kennedy. Scalar replacement in the presence of
developed a data-flow analysis framework to apply scalar replace- conditional control flow. Software – Practice & Experience,
ment in loops with control statements. 24(1):51–77, Jan. 1994.
Wolf and Lam [25] used unroll-and-jam in the context of their [12] G. Chaitin, M. Auslander, A. Chandra, J. Cocke, M. Hopkins, and
data locality optimizations; they referred to this as “register tiling.” P. Markstein. Register allocation via coloring. Computer Languages,
Carr and Kennedy [10] showed how to compute the unroll factors 6:45–57, Jan. 1981.
for unroll-and-jam from the dependence graph with the objective [13] G. J. Chaitin. Register allocation and spilling via graph coloring. In
of improving loop balance. Carr [8], Wolf, Maydan and Chen [26], Proceedings of the ACM SIGPLAN ’82 Symposium on Compiler
Construction, pages 98–105, Boston, MA, June 1982.
and Carr and Guan [9] combined optimization for instruction-level
[14] F. C. Chow and J. L. Hennessy. Register allocation by priority-based
parallelism and cache using unroll-and-jam and scalar replacement. coloring. In Proceedings of the ACM SIGPLAN ’84 Symposium on
Sarkar [24] used a local instruction scheduling based model to com- Compiler Construction, pages 222–232, Montreal, Quebec, June
pute unroll-and-jam amounts. Qian, Carr and Sweany [23] de- 1984.
veloped a performance model based upon software pipelining that [15] K.-H. Drechsler and M. P. Stadel. A solution to a problem with
included intercluster register copies for clustered VLIW architec- Morel and Renvoise’s “Global optimization by suppression of partial
tures. redundancies”. ACM Transactions on Programming Languages and
Since reuse of array values in registers can be captured by de- Systems, 10(4):635–640, Oct. 1988.
[16] E. Duesterwald, R. Gupta, and M. L. Soffa. A practical data flow
pendence information, it is only natural to expand the use of de-
framework for array reference analysis and its use in optimizations.
pendence to improve performance of the data cache. McKinley, In Proceedings of the ACM SIGPLAN ’93 Conference on
Carr and Tseng [21] use dependence analysis directly to determine Programming Language Design and Implementation, pages 68–77,
the cache behavior of loops and apply loop permutation and loop Albuquerque, NM, June 1993.
fusion to improve locality. [17] D. Kuck, R. Kuhn, B. Leasure, and M. Wolfe. The structure of an
Many commercial compilers today incorporate both scalar re- advanced retargetable vectorizer. In Supercomputers: Design and
placement and unroll-and-jam in some form. We are aware of im- Applications, pages 163–178. IEEE Computer Society Press, Silver
Spring, MD., 1984.
plementations of these optimizations in compilers for the follow-
[18] D. Kuck, R. Kuhn, D. Padua, B. Leasure, and M. Wolfe. Dependence
ing architectures: Texas Instruments TMS320C6x, Compaq Alpha, graphs and compiler optimizations. In Conference Record of the
MIPS R10000, and Intel IA-64. Eight ACM Symposium on the Principles of Programming
Languages, 1981.
[19] D. Kuck, Y. Muraoka, and S. Chen. On the number of operations
ACKNOWLEDGMENTS simultaneously executable in fortran-like programs and their
The authors would like to express our sincerest gratitude to the late resulting speedup. IEEE Transactions on Computers,
John Cocke, who inspired and funded this work while at IBM. His C-21(12):1293–1310, Dec. 1972.
accomplishments made him a pioneer not only of computer archi- [20] L. Lamport. The parallel execution of DO-loops. Communications of
the ACM, 17(2):83–93, 1974.
tecture but also in compiler optimization.
[21] K. S. McKinley, S. Carr, and C.-W. Tseng. Improving data locality
with loop transformations. ACM Transactions on Programming
REFERENCES Languages and Systems, 18(4):424–453, 1996.
[22] D. A. Padua and M. J. Wolfe. Advanced compiler optimizations for
[1] W. Abu-Sufah. Improving the Performance of Virtual Memory supercomputers. Communications of the ACM, 29(12):1184–1201,
Computers. PhD thesis, University of Illinois, 1978. Dec. 1986.
[2] F. E. Allen and J. Cocke. A catalogue of optimizing transformations. [23] Y. Qian, S. Carr, and P. Sweany. Optimizing loop performance for
In Design and Optimization of Compilers, pages 1–30. Prentice-Hall, clustered vliw architectures. In Proceedings of the 2002 International
1972. Conference on Parallel Architectures and Compilation Techniques,
[3] J. Allen, D. Callahan, and K. Kennedy. Automatic decomposition of pages 271–280, Charlottesville, VA, Sept. 2002.
scientific programs for parallel execution. In Conference Record of [24] V. Sarkar. Optimized unrolling of nested loops. In Proceedings of the
the Fourteenth ACM Symposium on the Principles of Programming 2000 International Conference on Supercomputing, pages 153–166,
Languages, Munich, West Germany, Jan. 1987. Sante Fe, NM, May 2000.
[4] J. Allen and K. Kennedy. Automatic translation of Fortran programs [25] M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In
to vector form. ACM Transactions on Programming Languages and Proceedings of the ACM SIGPLAN ’91 Conference on Programming
Systems, 9(4):491–542, Oct. 1987. Language Design and Implementation, pages 30–44, Toronto,
[5] J. Allen and K. Kennedy. Vector register allocation. IEEE Ontario, June 1991.
Transactions on Computers, 41(10):1290 – 1317, Oct. 1992. [26] M. E. Wolf, D. E. Maydan, and D.-K. Chen. Combining loop
[6] P. Briggs, K. D. Cooper, K. Kennedy, and L. Torczon. Coloring transformations considering caches and scheduling. In Twenty-Ninth
heuristics for register allocation. In Proceedings of the ACM Annual Symposium on Micorarchitecture (MICRO-29), Dec. 1996.
SIGPLAN ’89 Conference on Programming Language Design and [27] M. Wolfe. Advanced loop interchange. In Proceedings of the 1986
Implementation, pages 275–284, Portland, OR, July 1989. International Conference on Parallel Processing, Aug. 1986.
[7] D. Callahan, J. Cocke, and K. Kennedy. Estimating interlock and [28] M. Wolfe. Loop skewing: The wavefront method revisited. Journal
improving balance for pipelined machines. Journal of Parallel and of Parallel Programming, 15(4):279–293, Aug. 1986.
Distributed Computing, 5:334–358, 1988.

Improving Register Allocation For Subscripted Variables

More Related Content

Similar to Improving Register Allocation For Subscripted Variables (20)

Recently uploaded (20)

Improving Register Allocation For Subscripted Variables