CMU Computer Systems: Program Optimization

最新推荐文章于 2022-10-14 22:17:01 发布

WihauShe

最新推荐文章于 2022-10-14 22:17:01 发布

阅读量259

点赞数

CC 4.0 BY-SA版权

分类专栏： Others 文章标签： cmu computer system optimization

本文链接：https://blog.csdn.net/qq_39376697/article/details/123356723

Others 专栏收录该内容

36 篇文章

订阅专栏

本文探讨了代码优化的重要性，包括通用优化策略如预计算、强度减少和公共子表达式共享，以及优化障碍如过程调用和内存别名。同时强调了理解程序如何编译、执行及现代处理器工作原理对于性能优化的必要性。介绍了利用指令级并行性和处理条件的方法，以及编译器优化的局限性。建议在多个层次上进行优化，并关注内层循环的性能。最后，讨论了利用超标量处理器和向量化指令提高性能的策略。

Optimization

Overview
Generally Useful Optimizations
- Code motion/precomputation
- Strength reduction
- Sharing of common subexpressions
- Removing unnecessary procedure calls
Optimization Blockers
- Procedure calls
- Memory aliasing
Exploiting Instruction-Level Parallism
Dealing with Conditionals

Performance Realties

There’s more to performance than asymptotic complexity
Constant factors matter tool
- Easily see 10:1 performance range depending on how code is written
- Must optimize at multiple levels:
  - algorithm, data representations, procedures, and loops
Must understand system to optimize performance
- How programs are compiled and executed
- How modern processors + memory systems operate
- How to measure program performance and identify bottlenecks
- How to improve performance without destroying code modular generality

Optimizing Compilers

Provide efficient mapping of program to machine
Don’t (usually) improve asymptotic efficiency
Have difficulty overcoming “optimization blockers”

Limitations of Optimizing Compilers

Operate under fundamental constraint
- Must not cause any change in program behavior
- Often prevents it from making optimizations that would only affect behavior under pathological conditions
Behavior that may be obvious to the programmer can be obfuscated by languages and coding styles
Most analysis is performed only within procedures
- Whole-program analysis is too expensive in most cases
- Newer versions of GCC do interprocedural analysis within individual files
Most analysis is based on static information
When in doubt, the compiler must be conservative

Generally Useful Optimizations

Optimizations that you or the compiler should do regardless of processor / compiler
Code Motion
- Reduce frequency with which computation performed
  - If it will always procedure same result
  - Especially moving code out of loop
- Reduction in Strength
  - Replace costly operation with simpler one
  - Shift, and instead of multiply or divide
  - Recognize sequence of products
- Share Common Subexpressions
  - Reuse portions of expressions
  - GCC will do this with -O1

Optimization Blocker #1: Procedure Calls

Why couldn’t compiler move strlen out of inner loop
- Procedure may have side effects
  - Alters global state each time called
- Function may not return same value for given arguments
  - Depends on other parts of global state
  - Procedure lower could interact with strlen
Warning
- Compiler treats procedure call as a black box
- Weak optimizations near them
Remedies
- Use of inline functions
- Do your own code motion

Optimization Blocker #2: Memory Aliasing

Aliasing
- Two different memory references specify single location
- Easy to have happen in C
  - Since allowed to do address arithmetic
  - Direct access to storage structures
- Get in habit of introducing local variables
  - Accumulating within loops
  - Your way of telling compiler not to check for aliasing

Exploiting Instruction-Level Parallelism

Need general understanding of modern
- Hardware can execute multiple instructions in parallel
Performance limited by data dependencies
Simple transformations can yield dramatic performance improvement
- Compilers often cannot make these transformations
- Lack of associativity and distributivity in floating-point arithmetic

Cycles Per Element (CPE)

Convenient way to express performance of program that operates on vectors in lists
Length = n
In our case: CPE = cycles per OP
T = CPE*n + Overhead
- CPE is slope of line

Superscalar Processor

Definition
- A superscalar processor can issue and execute multiple instructions in one cycle. The instructions are retrieved from a sequential instruction stream and are usually scheduled dynamically.
Benefit
- without programming effort, superscalar processor can take advantage of instruction level parallelism that most programs have

Pipelined Functional Units

Divide computation into stages
Pass partial computations from stage to stage
Stage i can start on new computation once values passed to i+1

Unrolling & Accumulating

Idea
- Can unroll to any degree L
- Can accumulate K results in parallel
- L must be multiple of K
Limitations
- Diminishing returns
  - Cannot go beyond throughput limitations of execution units
- Large overhead for short lengths
  - Finish off iterations sequentially

Using Vector Instructions

Make use of AVX Instructions
- Parallel operations on multiple data elements
- See Web Aside OPT: SIMD on CS: APP web page

Branch Prediction

Idea
- Guess which way branch will go
- Begin executing instructions at predicted position
  -But don’t actually modify register or memory data

Branch Misprediction Recovery

Performance Cost
- Multiple clock cycles on modern processor
- Can be a major performance limiter

Getting High Performance

Good compiler and flags
Don’t do anything stupid
- Watch out for hidden algorithmic inefficiencies
- Write compiler-friendly code
  - Watch out for optimization blockers
- Look carefully at innermost loops
Turn code for machine
- Exploit instruction-level parallelism
- Avoid unpredictable branches
- Make code cache friendly (Covered later in course)