CMU Computer Systems: Program Optimization

本文探讨了代码优化的重要性,包括通用优化策略如预计算、强度减少和公共子表达式共享,以及优化障碍如过程调用和内存别名。同时强调了理解程序如何编译、执行及现代处理器工作原理对于性能优化的必要性。介绍了利用指令级并行性和处理条件的方法,以及编译器优化的局限性。建议在多个层次上进行优化,并关注内层循环的性能。最后,讨论了利用超标量处理器和向量化指令提高性能的策略。

Optimization

  • Overview
  • Generally Useful Optimizations
    • Code motion/precomputation
    • Strength reduction
    • Sharing of common subexpressions
    • Removing unnecessary procedure calls
  • Optimization Blockers
    • Procedure calls
    • Memory aliasing
  • Exploiting Instruction-Level Parallism
  • Dealing with Conditionals
Performance Realties
  • There’s more to performance than asymptotic complexity
  • Constant factors matter tool
    • Easily see 10:1 performance range depending on how code is written
    • Must optimize at multiple levels:
      • algorithm, data representations, procedures, and loops
  • Must understand system to optimize performance
    • How programs are compiled and executed
    • How modern processors + memory systems operate
    • How to measure program performance and identify bottlenecks
    • How to improve performance without destroying code modular generality
Optimizing Compilers
  • Provide efficient mapping of program to machine
  • Don’t (usually) improve asymptotic efficiency
  • Have difficulty overcoming “optimization blockers”
Limitations of Optimizing Compilers
  • Operate under fundamental constraint
    • Must not cause any change in program behavior
    • Often prevents it from making optimizations that would only affect behavior under pathological conditions
  • Behavior that may be obvious to the programmer can be obfuscated by languages and coding styles
  • Most analysis is performed only within procedures
    • Whole-program analysis is too expensive in most cases
    • Newer versions of GCC do interprocedural analysis within individual files
  • Most analysis is based on static information
  • When in doubt, the compiler must be conservative
Generally Useful Optimizations
  • Optimizations that you or the compiler should do regardless of processor / compiler
  • Code Motion
    • Reduce frequency with which computation performed
      • If it will always procedure same result
      • Especially moving code out of loop
    • Reduction in Strength
      • Replace costly operation with simpler one
      • Shift, and instead of multiply or divide
      • Recognize sequence of products
    • Share Common Subexpressions
      • Reuse portions of expressions
      • GCC will do this with -O1
Optimization Blocker #1: Procedure Calls
  • Why couldn’t compiler move strlen out of inner loop
    • Procedure may have side effects
      • Alters global state each time called
    • Function may not return same value for given arguments
      • Depends on other parts of global state
      • Procedure lower could interact with strlen
  • Warning
    • Compiler treats procedure call as a black box
    • Weak optimizations near them
  • Remedies
    • Use of inline functions
    • Do your own code motion
Optimization Blocker #2: Memory Aliasing
  • Aliasing
    • Two different memory references specify single location
    • Easy to have happen in C
      • Since allowed to do address arithmetic
      • Direct access to storage structures
    • Get in habit of introducing local variables
      • Accumulating within loops
      • Your way of telling compiler not to check for aliasing
Exploiting Instruction-Level Parallelism
  • Need general understanding of modern
    • Hardware can execute multiple instructions in parallel
  • Performance limited by data dependencies
  • Simple transformations can yield dramatic performance improvement
    • Compilers often cannot make these transformations
    • Lack of associativity and distributivity in floating-point arithmetic
Cycles Per Element (CPE)
  • Convenient way to express performance of program that operates on vectors in lists
  • Length = n
  • In our case: CPE = cycles per OP
  • T = CPE*n + Overhead
    • CPE is slope of line
Superscalar Processor
  • Definition
    • A superscalar processor can issue and execute multiple instructions in one cycle. The instructions are retrieved from a sequential instruction stream and are usually scheduled dynamically.
  • Benefit
    • without programming effort, superscalar processor can take advantage of instruction level parallelism that most programs have
Pipelined Functional Units
  • Divide computation into stages
  • Pass partial computations from stage to stage
  • Stage i can start on new computation once values passed to i+1
Unrolling & Accumulating
  • Idea
    • Can unroll to any degree L
    • Can accumulate K results in parallel
    • L must be multiple of K
  • Limitations
    • Diminishing returns
      • Cannot go beyond throughput limitations of execution units
    • Large overhead for short lengths
      • Finish off iterations sequentially
Using Vector Instructions
  • Make use of AVX Instructions
    • Parallel operations on multiple data elements
    • See Web Aside OPT: SIMD on CS: APP web page
Branch Prediction
  • Idea
    • Guess which way branch will go
    • Begin executing instructions at predicted position
      -But don’t actually modify register or memory data
Branch Misprediction Recovery
  • Performance Cost
    • Multiple clock cycles on modern processor
    • Can be a major performance limiter
Getting High Performance
  • Good compiler and flags
  • Don’t do anything stupid
    • Watch out for hidden algorithmic inefficiencies
    • Write compiler-friendly code
      • Watch out for optimization blockers
    • Look carefully at innermost loops
  • Turn code for machine
    • Exploit instruction-level parallelism
    • Avoid unpredictable branches
    • Make code cache friendly (Covered later in course)
Origins of the Book This book stems from an introductory course that we developed at Carnegie Mellon University in the fall of 1998, called 15-213: Introduction to Computer Systems (ICS) [14]. The ICS course has been taught every semester since then. Over 400 students take the course each semester. The students range from sophomores to graduate students in a wide variety of majors. It is a required core course for all undergraduates in the CS and ECE departments at Carnegie Mellon, and it has become a prerequisite for most upper-level systems courses in CS and ECE. The idea with ICS was to introduce students to computers in a different way. Few of our students would have the opportunity to build a computer system. On the other hand, most students, including all computer scientists and computer engineers, would be required to use and program computers on a daily basis. So we decided to teach about systems from the point of view of the programmer, using the following filter: we would cover a topic only if it affected the performance, correctness, or utility of user-level C programs. For example, topics such as hardware adder and bus designs were out. Topics such as machine language were in; but instead of focusing on how to write assembly language by hand, we would look at how a C compiler translates C constructs into machine code, including pointers, loops, procedure calls, and switch statements. Further, we would take a broader and more holistic view of the system as both hardware and systems software, covering such topics as linking, loading,processes, signals, performance optimization, virtual memory, I/O, and network and concurrent programming. This approach allowed us to teach the ICS course in a way that is practical, concrete, hands-on, and exciting for the students. The response from our students and faculty colleagues was immediate and overwhelmingly positive, and we realized that others outside of CMU might benefit from using our approach. Hence this book, which we developed from the ICS lecture notes, and which we have now revised to reflect changes in technology and in how computer systems are implemented. Via the multiple editions and multiple translations of this book, ICS and many variants have become part of the computer science and computer engineering curricula at hundreds of colleges and universities worldwide.翻译以上英文为中文
最新发布
08-05
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值