HYPPO - NECSTTechTalk 23/04/2020

Managing power and performance trade-oﬀs
in distributed cloud-na7ve infrastructures
NECST (Thursday) Friday talk, 04/23/2020
Rolando Brondolin
<rolando.brondolin@polimi.it>

A cloudy landscape
Cloud services became more structured and variegated in the last few years
7
Physical Hardware
VM1 VM2
C1 C2 C3 C4
The complexity of the
environment is left
to the Cloud provider

8
Data-centers will consume 8% of the energy consump7on of the world by 2030 [1]
[1] Anders SG Andrae and Tomas Edler. On global electricity usage of communication technology: trends to 2030. Challenges, 6(1):117–157, 2015.

The need for power awareness
9
[2] Beloglazov, A., Buyya, R., Lee, Y. C., Zomaya, A., et Al, taxonomy and survey of energy-eﬃcient datacenters and
cloud compu7ng systems. Advances in computers 82, 2 (2011) 47–111.
[1] Cui, Yan, et al. "Total cost of ownership model for data center technology evaluation." Thermal and Thermomechanical Phenomena in Electronic Systems (ITherm), 2017 16th
IEEE Intersociety Conference on. IEEE, 2017.
Power consumption is accounted for 20% of a data-center TCO [1]
Lifetime energy cost will exceed hardware cost in the near future [2]
Energy budgets and
power caps constraint the
performance of the system
Power consumption is
affected by a plethora of
different actors
Performance are key for
production systems
despite power consumption

The need for power awareness
10
[2] Beloglazov, A., Buyya, R., Lee, Y. C., Zomaya, A., et Al, taxonomy and survey of energy-eﬃcient datacenters and
cloud compu7ng systems. Advances in computers 82, 2 (2011) 47–111.
[1] Cui, Yan, et al. "Total cost of ownership model for data center technology evaluation." Thermal and Thermomechanical Phenomena in Electronic Systems (ITherm), 2017 16th
IEEE Intersociety Conference on. IEEE, 2017.
Power consumption is accounted for 20% of a data-center TCO [1]
Lifetime energy cost will exceed hardware cost in the near future [2]
Energy budgets and
power caps constraint the
performance of the system
Power consumption is
affected by a plethora of
different actors
Performance are key for
production systems
despite power consumption
Tradeoff between power consumption and performance
Applications provide a service to the user:
• performance should be guaranteed
• power saving comes from run-time management

Cloud-na7ve technologies
11
According to the CNCF[1]:
"Cloud native technologies empower organizations to build and run
scalable applications in modern, dynamic environments such as public,
private, and hybrid clouds.
Containers, service meshes, microservices, immutable infrastructure,
and declarative APIs exemplify this approach."
[1] https://github.com/cncf/toc/blob/master/DEFINITION.md

12Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon
Jackson, et al. An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems. In
Proceed- ings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems,
pages 3–18, 2019.
We said DeathStars…

Power vs performance: challenges
• how we can measure the behavior of cloud-na7ve applica7ons and the environment in
terms of performance and power consump:on 
• how we can define performance throughout cloud-na7ve applica7ons and how we can
define meaningful performance targets 
• how we can effec7vely and precisely reduce power consump:on while preserving
performances of the running workloads
13

Run-7me power management
14
O
A D
Instrumentation-free observation
•Performance monitor
•Power monitor and attribution
Decide and optimize
•Reactive/predictive Power-aware control
•Control policies based on overall goals
Fast actuation
•Enforce policies to save power
•Tune power and resource allocation
Proposed general approach
Observe - Decide - Act loop
Goal: Introduce autonomicity to improve energy eﬃciency and introduce
energy propor:onality of cloud-na7ve applica7ons

15
O
A
D
RAPL
CPU
quota
Thread
aﬃnity
DEEP-mon
• per-container power monitor
• per-container performance measurements and PMCs
• per-request latency measurements
HyPPO
• reac:ve control
• CPU usage & request
• power capping
Act

16
O
A
D
RAPL
CPU
quota
Thread
aﬃnity
DEEP-mon
HyPPO
• reac:ve control
• power capping
Act

DEEP-mon at a glance
17
• DEEP-mon is a HT-aware ﬁne-grain power monitor for container-based
environments
- precise power aGribu:on to containers
- instrumenta:on free, watch workloads from outside
- lightweight, with lille overhead on the target workloads and systems
- scalable and distributed, to observe Kubernetes clusters
• Monitoring ingredients:
Container
execution
Resource
usage
Power
consumption
Context
switch
Performance
Counter (cycle)*
Intel
RAPL
* cycles has 99% correlation w.r.t. CPU power usage

Anatomy of a power monitoring agent
18
user-space
kernel-space
Intel RAPL
DEEP-mon
Power attribution
Docker and Kubernetes metrics
kernel tracing
PMC
context switch
Linux CFS
Monitoring back-end

Anatomy of a power monitoring agent
19
user-space
kernel-space
Intel RAPL
DEEP-mon
Power attribution
Docker and Kubernetes metrics
PMC
Monitoring back-end
200K evts/s
kernel tracing context switch
Linux CFS

Kernel level data acquisi7on (1)
• We cannot send each context switch to user-space
- too many events per second to process
- too much overhead
• Introduce in-kernel data aggrega7on:
20
eBPF and BCC:
build, inject and execute code
in a Kernel VM
trace context switch,
count PMCs on the ﬂy
store data in
eBPF data structures
send one big event instead
of many small ones
DEEP-mon
kernel

Correlate power and performance
• At ﬁxed 7me intervals we collect the thread map
• Then we extract power measurement from RAPL and we
account it for each thread:
21
eBPF output
Thread1
Thread2
Thread3
thread map
G benchmarks from NPB with
HT experiments pin two threads
ysical core
run on a Dell PowerEdge
n E5-2680 v2 (10 cores
and with Ubuntu Linux
st experiment shows that
cal cores mapped on the
umption is ' 1.15 with
on that same physical
execution periods in which the thread was co-running on the
same physical core via HT, weighted by the HTr ratio and
divided by 2 to equally divide the overlapping cycles among
the two threads. In this context an execution period is deﬁned
as the time between context switches on the physical core
where the thread is scheduled.
Starting from Equation (1), we can now attribute the power
measured by RAPL for our thread T1 following Equation (2),
where |K| is the cardinality of the set K of threads running in
the server in a given period of time and |S| is the cardinality
of the set S of sockets in the system.
PT 1(t) =
|S|
X
s=0
RAPLcore(t, s) ·
CyclesT W1 (t, s)
P|K|
k=0 CyclesT Wk
(t, s)
!
(2)
Starting from this result, the next sections will provide
details on how we implemented power attribution for each
thread and container running in the system.
B. Kernel level data acquisition
The power attribution model described in Section III-A
needs a precise measurement of the performance counter
Power of Thread 1
Sum among all sockets
RAPL measurement of the socket
Thread weight inside
the socket power consumption
• Finally we group each thread by container
DEEP-mon
kernel

Monitoring containers at scale
• Once power data is collected, we can send it to a back-end  
on a regular basis
- further aggrega7on of metrics data
- Kubernetes cluster level view
• Backend exposes data for visualiza:on and autonomic power management
22

Benchmarks
Cloud Benchmarks: Phoronix test suite pts/apache, pts/Nginx, pts/ﬁo
HPC Benchmarks: NAS Parallel benchmarks EP, MG, CG
Experimental results
23
Network and syscall intensive benchmarks CPU and memory intensive tasks
Cloud benchmarks
app overhead
< 3.3%
HPC benchmarks
app overhead
< 4%
Cloud benchmarks
power overhead
1.74% avg
HPC benchmarks
power overhead
0.90% avg
Evalua:on goals
Monitoring should introduce minimum overhead
We evaluated DEEP-mon w.r.t. its overhead on applica:ons and the target system

25
O
A
D
RAPL
CPU
quota
Thread
aﬃnity
DEEP-mon
HyPPO
• reac:ve control
• power capping
Act

HyPPO
• HyPPO is a Hybrid Performance-aware Power-capping  
Orchestrator for Kubernetes environments
- leverage run-7me monitoring data coming from DEEP-mon
- guarantee SLAs for each container
- autonomic management of SLAs and power consump7on
- hybrid: HW power capping, SW resource management
26
Energy propor:onality
The resources I use == the energy bill I pay
Performance ﬁrst
Guaranteed user experience, saving power

HyPPO at a glance
• What if we try to slow-down some components?
- reducing performance means reducing power usage (most of the 7me)
- do it only without aﬀec7ng end user experience (and SLAs)
- give performance back when needed
• Opera:ve steps
- measure CPU usage and power consump7on in real-7me
- reason on new possible alloca7ons
- act based on previous decisions
• Technical challenges
- how to operate on a distributed system
- workload instrumenta7on and power alribu7on
- autonomicity in decision process is key, keep an eye on goals and requirements
- how to push up & down performance and power usage
27
O
A D

Distributed ODA loop
28
Master
Node Node
API
API
Pod Pod
API
Pod Pod
DEEP-mon agent DEEP-mon agent
Monitor
Backend
ACTUATOR
AGENT
ACTUATOR
AGENT
HyPPO

HyPPO controller
29
mples of metrics and Kubernetes status from each monitoring agent in the GRPC collector.
base for later use. Monitoring samples are unpacked inside the Metrics workers and aggregated
ored in an InfluxDB database, which is queried by the monitoring frontend to show real-time
loop components.
hen access
the REST
ueries the
the latter
gramming
the others,
ge, Power
time and
metrics by
nt of our
composed
ch data to
utcome to
menting a
node of the cluster, powern,c is the power consumed by the
c-th container running on the n-th node.
powern = Pidle +
CX
c=0
(powern,c + i(c)) (1)
Equation (1) defines the power for the n-th node as the sum
of the idle power Pidle and the sums of all the powers
consumed by the c-th container running on the n-th node,
plus a contribute i(c). The contribution can be positive or
negative and is expressed in Equation (2), where cpu requestc
represents the CPU request expressed for the c-th container,
cpu usagec represent the actual CPU consumption for the c-
th container and P represents a proportional factor that can
be defined in controller configurations. Each container CPU
usage data point passes through the controller represented by
Equations (1) and (2). In this way, it contributes positively
or negatively to the total power consumption of the node on
e the latter
ogramming
g the others,
sage, Power
n time and
metrics by
ent of our
composed
uch data to
outcome to
ementing a
tuators.
ST endpoint
ashion (the
controller).
nformation:
ners power
first set of
dictionary
indicating
powern = Pidle +
CX
c=0
(powern,c + i(c)) (1)
Equation (1) defines the power for the n-th node as the sum
of the idle power Pidle and the sums of all the powers
consumed by the c-th container running on the n-th node,
plus a contribute i(c). The contribution can be positive or
negative and is expressed in Equation (2), where cpu requestc
represents the CPU request expressed for the c-th container,
cpu usagec represent the actual CPU consumption for the c-
th container and P represents a proportional factor that can
be defined in controller configurations. Each container CPU
usage data point passes through the controller represented by
Equations (1) and (2). In this way, it contributes positively
or negatively to the total power consumption of the node on
which the container is actually running.
i(c) =
(
(cpu usagec cpu requestc) ⇤ P if 9cpu usagec
0 if 6 9cpu usagec
(2)
The P parameter was chosen after several experiments and
represent the pace at which the controller tries to fill the
opportunity gap. It is defined in the configuration file of the
controller as 10000 mW (10W). In case a container does not
For each host we compute the power cap to be enforced depending on running containers
CONTROLLER
power of node n
power of idle system power of the container
power adjusting
under/over utilisation condition proportional factor (10W)
Power is then adjusted depending on how the CPU is used

Actua7on
30
ACTUATOR
AGENT
RAPL MSR
Power unit Power limit
Power budget node I
• For each host we have a power budget
• We leverage Intel RAPL to enforce the
power capping on the system
• Then we translate the power budget
into power units
• Finally we write in the power limit of
the RAPL MSR the power cap

Experimental setup
31
Testbed
Kubernetes cluster composed by 2 homogenous nodes
Node specs: Dell PowerEdge r720xd, 2x Intel Xeon E5-2680 Ivy Bridge (20 HT), 2.80GHz, 380GB of RAM
Benchmarks: Phoronix Test Suite, CPU request = 5 cores (5000 millicpus)
apache-cpu CPU Request
CPU%
0
200
400
Execution Time [s]
0 20 40 60 80 100 120 140 160
Apache CPU Opportunity Gap
apache-cpu CPU Request
CPU%
0
200
400
Execution Time [s]
0 20 40 60 80 100 120 140 160
Apache CPU Opportunity Gap
Goal
HyPPO should be able to guarantee performance
and at the same 7me try to reduce power usage

Experimental results (1): Apache
32
Hyppo provides good performance,
5% SLA violation
apache-cpu apache-cpu-ctrl CPU Request
CPU%
0
200
400
600
Execution Time [s]
0 20 40 60 80 100 120 140 160
Apache CPU usage
apache-pw apache-pw-ctrl
Power[mW]
0
20000
40000
60000
Execution Time [s]
0 10 20 30 40 50 60 70 80
Apache Power consumed

Experimental results (2): Fio
33
fio-cpu fio-cpu-ctrl CPU Request
CPU%
0
200
400
Execution Time [s]
0 20 40 60 80 100 120 140
FIO CPU Usage
Power saving,
but higher benchmark execution time
fio-pw fio-pw-ctrl
Power[mW]
0
20000
40000
Execution Time [s]
0 10 20 30 40 50 60 70
FIO Power Consumption

Experimental results (3): Nginx
34
nginx-cpu nginx-cpu-ctrl CPU Request
CPU%
0
200
400
Execution Time [s]
0 20 40 60 80 100 120 140 160
NGINX CPU Usage
Good power saving results,
Few eﬀects on NGINX CPU usage
(few threads)
nginx-pw nginx-pw-ctrl
Power[mW]
0
20000
40000
60000
Execution Time [s]
0 10 20 30 40 50 60 70 80
NGINX Power Consumption

Experimental results (3): Nginx
35
nginx-cpu nginx-cpu-ctrl CPU Request
CPU%
0
200
400
Execution Time [s]
0 20 40 60 80 100 120 140 160
NGINX CPU Usage
Good power saving results,
Few eﬀects on NGINX CPU usage
(few threads)
nginx-pw nginx-pw-ctrl
Power[mW]
0
20000
40000
60000
Execution Time [s]
0 10 20 30 40 50 60 70 80
NGINX Power Consumption
Preliminary tests showed a power saving ranging from 5% to 45%  
and SLA violations of 2,5% on average
HyPPO currently works well with multithread workloads

36
Some7mes we cannot slow down applica7ons
Heterogeneity helps us to meet SLAs and improve energy eﬃciency

37Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon
pages 3–18, 2019.
Batches of DeathStars

Problem deﬁni7on
FPGAVM
App
Runtime
req/s
time
38

Problem deﬁni7on
FPGA
req/s
time
39
FPGAVM
App
Runtime
VM
App
Runtime

Problem deﬁni7on
FPGAVM
App
Runtime
FPGAVM
App
Runtime
FPGA
VM
App
Runtime
40

Our approach: BlastFunc7on
FPGA
App
FPGA
App
FPGA
App
41

Our approach: BlastFunc7on
FPGA
App
FPGA
App
FPGA
App
42

BlastFunc7on in a nutshell
FPGA Sharing system for Microservices and Serverless functions
Reconfiguration-aware and Accelerator-independent
Transparent integration and scalability
43
Fully integrated with existing cloud orchestrator (Kubernetes)

Containers / Functions
Orchestrator
Accelerator
Registry
Node
FPGA
Device
Manager
FPGA
Device
Manager
High level architecture
Node
FPGA
Device
Manager
App
44

Remote OpenCL library
Connects to Registry/Device Manager to access the remote Device
Synchronous + Asynchronous ac7ons
Supports multiple kernels, accelerators and concurrent queues
Transparent OpenCL wrapper
45
Easy integration (dynamic library w/OpenCL loader)
App

Device Manager
Controls a single underlying FPGA device
Exposes gRPC and shared memory interfaces
to perform actions through the device
Reconfiguration-aware
Group tasks (read/exec/write) based on source
function
46
Device
Manager
App
Ex. Thread
DeviceManager
Single Task
Buffer Write
Kernel Run
Buffer Read
endpoint
AppApp
FPGA

Accelerators Registry
Master component of the system
Contains all the informations about functions and devices
Performs allocation before and after function deployment
Integrated with the orchestrator through Hooks and APIs
Allocation based on runtime metrics of the system
47
Accelerator
Registry

Experimental evalua7on
48
• Master node: Intel Xeon-W3530, 24GB RAM
• 2 Worker nodes: Intel i7-6700, 32GB RAM
• Terasic DE5a-Net FPGA on all nodes
• Overhead, perf. behaviour on single / multi-app
Sobel Filter Matrix Multiplication
Test Applications/Kernels
CNNs
Setup

Overhead results - I/O latencies
49
• Overhead given by memory copy operations
• 3 data copies for BlastFunction, 1 data copy for BlastFunction shm
• ~47% latency slowdown w.r.t native execution (for I/O only operations)
• One data copy to guarantee OpenCL transparency

Overhead results
50
Sobel Matrix Multiplication
• BlastFunction overhead starts from 2.46 ms arriving to 24ms w.r.t. Native
• BlastFunction shm keeps a 2ms overhead (24.04%) for all executions
w.r.t. Native from images of 10x10 pixels to 1920x1080 pixels
• Shared memory provides limited and acceptable overhead
• Minimum RTT of 2ms for both BlastFunction and BlastFunction shm
• BlastFunction shm has a maximum overhead of 17ms (0.27%) w.r.t.
native (over 3.571s of Native execution time at the largest input size)
• Shared memory provides negligible overhead

Mul7-applica7on results (sobel accelerator)
51
Throughput
1.40x
% U:liza:on
1.46x
Applica:ons
5 vs 3
BLF, Low load
BLF, Medium load
BLF, High load
Native, Low load
Native, Medium load
Native, High load
Latency (max)
1.04x

Mul7-applica7on results (matmult accelerator)
52
Throughput
2.15x
% U:liza:on
1.17x
Applica:ons
5 vs 3
BLF, Low load
BLF, Medium load
BLF, High load
Native, Low load
Native, Medium load
Native, High load
Latency (max)
0.59x

Mul7-applica7on results (CNN accelerator)
53
Throughput
1.26x
% U:liza:on
1.06x
Applica:ons
5 vs 3
BLF, Low load
BLF, High load
Native, Low load
Native, High load
Latency (max)
1.40x

Conclusion
• Cloud-na7ve applica7ons monitoring: performance and power observa7ons
- Black-box power monitoring approach
- Performance counters, power consump7on, CPU % measured w/ negligible overhead
• Cloud-na7ve performance-aware power capping orchestra7on
- Distributed power capping for cloud-na7ve applica7ons
- 25% average power savings, 5% SLA viola7on rate, works well w/ mul7thread workloads
• Looking forward: accelerated cloud-na7ve workloads
- Improve performance of cloud-na7ve microservices w/ FPGA-based systems
- Sharing improves u7liza7on of the FPGAs with acceptable latency degrada7on
54

Future work
55
Latency
Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon
pages 3–18, 2019.

Acknowledgements
I would like to thank these amazing people for all their work and support
towards the realiza7on of this work (and many others)
56
Fabiola Casasopra
Luca Danelux
Luca Malagux
Giorgia Fiscalex
Sara Notargiacomo
Marco Santambrogio
Tommaso Sardelli
Marco Arnaboldi
Marco Bacis
Andrea Strada
Daniele Rossex
Samuele Barbieri

HYPPO - NECSTTechTalk 23/04/2020

More Related Content

What's hot (20)

Similar to HYPPO - NECSTTechTalk 23/04/2020 (20)

More from NECST Lab @ Politecnico di Milano (20)

Recently uploaded (20)

HYPPO - NECSTTechTalk 23/04/2020