Managing power and performance trade-offs
in distributed cloud-na7ve infrastructures
NECST (Thursday) Friday talk, 04/23/2020
Rolando Brondolin
<rolando.brondolin@polimi.it>
2
3
Cloud compu7ng
4
5
6
A cloudy landscape
Cloud services became more structured and variegated in the last few years
7
Physical Hardware
VM1 VM2
C1 C2 C3 C4
The complexity of the
environment is left
to the Cloud provider
8
Data-centers will consume 8% of the energy consump7on of the world by 2030 [1]
[1] Anders SG Andrae and Tomas Edler. On global electricity usage of communication technology: trends to 2030. Challenges, 6(1):117–157, 2015.
The need for power awareness
9
[2] Beloglazov, A., Buyya, R., Lee, Y. C., Zomaya, A., et Al, taxonomy and survey of energy-efficient datacenters and
cloud compu7ng systems. Advances in computers 82, 2 (2011) 47–111.
[1] Cui, Yan, et al. "Total cost of ownership model for data center technology evaluation." Thermal and Thermomechanical Phenomena in Electronic Systems (ITherm), 2017 16th
IEEE Intersociety Conference on. IEEE, 2017.
Power consumption is accounted for 20% of a data-center TCO [1]
Lifetime energy cost will exceed hardware cost in the near future [2]
Energy budgets and
power caps constraint the
performance of the system
Power consumption is
affected by a plethora of
different actors
Performance are key for
production systems
despite power consumption
The need for power awareness
10
[2] Beloglazov, A., Buyya, R., Lee, Y. C., Zomaya, A., et Al, taxonomy and survey of energy-efficient datacenters and
cloud compu7ng systems. Advances in computers 82, 2 (2011) 47–111.
[1] Cui, Yan, et al. "Total cost of ownership model for data center technology evaluation." Thermal and Thermomechanical Phenomena in Electronic Systems (ITherm), 2017 16th
IEEE Intersociety Conference on. IEEE, 2017.
Power consumption is accounted for 20% of a data-center TCO [1]
Lifetime energy cost will exceed hardware cost in the near future [2]
Energy budgets and
power caps constraint the
performance of the system
Power consumption is
affected by a plethora of
different actors
Performance are key for
production systems
despite power consumption
Tradeoff between power consumption and performance
Applications provide a service to the user:
• performance should be guaranteed
• power saving comes from run-time management
Cloud-na7ve technologies
11
According to the CNCF[1]:
"Cloud native technologies empower organizations to build and run
scalable applications in modern, dynamic environments such as public,
private, and hybrid clouds.
Containers, service meshes, microservices, immutable infrastructure,
and declarative APIs exemplify this approach."
[1] https://github.com/cncf/toc/blob/master/DEFINITION.md
12Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon
Jackson, et al. An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems. In
Proceed- ings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems,
pages 3–18, 2019.
We said DeathStars…
Power vs performance: challenges
• how we can measure the behavior of cloud-na7ve applica7ons and the environment in
terms of performance and power consump:on

• how we can define performance throughout cloud-na7ve applica7ons and how we can
define meaningful performance targets

• how we can effec7vely and precisely reduce power consump:on while preserving
performances of the running workloads
13
Run-7me power management
14
O
A D
Instrumentation-free observation
•Performance monitor
•Power monitor and attribution
Decide and optimize
•Reactive/predictive Power-aware control
•Control policies based on overall goals
Fast actuation
•Enforce policies to save power
•Tune power and resource allocation
Proposed general approach
Observe - Decide - Act loop
Goal: Introduce autonomicity to improve energy efficiency and introduce
energy propor:onality of cloud-na7ve applica7ons
Run-7me power management
15
O
A
D
RAPL
CPU
quota
Thread
affinity
DEEP-mon
• per-container power monitor
• per-container performance measurements and PMCs
• per-request latency measurements
HyPPO
• reac:ve control
• CPU usage & request
• power capping
Act
Run-7me power management
16
O
A
D
RAPL
CPU
quota
Thread
affinity
DEEP-mon
• per-container power monitor
• per-container performance measurements and PMCs
• per-request latency measurements
HyPPO
• reac:ve control
• CPU usage & request
• power capping
Act
DEEP-mon at a glance
17
• DEEP-mon is a HT-aware fine-grain power monitor for container-based
environments
- precise power aGribu:on to containers
- instrumenta:on free, watch workloads from outside
- lightweight, with lille overhead on the target workloads and systems
- scalable and distributed, to observe Kubernetes clusters
• Monitoring ingredients:
Container
execution
Resource
usage
Power
consumption
Context
switch
Performance
Counter (cycle)*
Intel
RAPL
* cycles has 99% correlation w.r.t. CPU power usage
Anatomy of a power monitoring agent
18
user-space
kernel-space
Intel RAPL
DEEP-mon
Power attribution
Docker and Kubernetes metrics
kernel tracing
PMC
context switch
Linux CFS
Monitoring back-end
Anatomy of a power monitoring agent
19
user-space
kernel-space
Intel RAPL
DEEP-mon
Power attribution
Docker and Kubernetes metrics
PMC
Monitoring back-end
200K evts/s
kernel tracing context switch
Linux CFS
Kernel level data acquisi7on (1)
• We cannot send each context switch to user-space
- too many events per second to process
- too much overhead
• Introduce in-kernel data aggrega7on:
20
eBPF and BCC:
build, inject and execute code
in a Kernel VM
trace context switch,
count PMCs on the fly
store data in
eBPF data structures
send one big event instead
of many small ones
DEEP-mon
kernel
Correlate power and performance
• At fixed 7me intervals we collect the thread map
• Then we extract power measurement from RAPL and we
account it for each thread:
21
eBPF output
Thread1
Thread2
Thread3
thread map
G benchmarks from NPB with
HT experiments pin two threads
ysical core
run on a Dell PowerEdge
n E5-2680 v2 (10 cores
and with Ubuntu Linux
st experiment shows that
cal cores mapped on the
umption is ' 1.15 with
on that same physical
execution periods in which the thread was co-running on the
same physical core via HT, weighted by the HTr ratio and
divided by 2 to equally divide the overlapping cycles among
the two threads. In this context an execution period is defined
as the time between context switches on the physical core
where the thread is scheduled.
Starting from Equation (1), we can now attribute the power
measured by RAPL for our thread T1 following Equation (2),
where |K| is the cardinality of the set K of threads running in
the server in a given period of time and |S| is the cardinality
of the set S of sockets in the system.
PT 1(t) =
|S|
X
s=0
RAPLcore(t, s) ·
CyclesT W1 (t, s)
P|K|
k=0 CyclesT Wk
(t, s)
!
(2)
Starting from this result, the next sections will provide
details on how we implemented power attribution for each
thread and container running in the system.
B. Kernel level data acquisition
The power attribution model described in Section III-A
needs a precise measurement of the performance counter
Power of Thread 1
Sum among all sockets
RAPL measurement of the socket
Thread weight inside
the socket power consumption
• Finally we group each thread by container
DEEP-mon
kernel
Monitoring containers at scale
• Once power data is collected, we can send it to a back-end 

on a regular basis
- further aggrega7on of metrics data
- Kubernetes cluster level view
• Backend exposes data for visualiza:on and autonomic power management
22
Benchmarks
Cloud Benchmarks: Phoronix test suite pts/apache, pts/Nginx, pts/fio
HPC Benchmarks: NAS Parallel benchmarks EP, MG, CG
Experimental results
23
Network and syscall intensive benchmarks CPU and memory intensive tasks
Cloud benchmarks
app overhead
< 3.3%
HPC benchmarks
app overhead
< 4%
Cloud benchmarks
power overhead
1.74% avg
HPC benchmarks
power overhead
0.90% avg
Evalua:on goals
Monitoring should introduce minimum overhead
We evaluated DEEP-mon w.r.t. its overhead on applica:ons and the target system
24
Much data!
Run-7me power management
25
O
A
D
RAPL
CPU
quota
Thread
affinity
DEEP-mon
• per-container power monitor
• per-container performance measurements and PMCs
• per-request latency measurements
HyPPO
• reac:ve control
• CPU usage & request
• power capping
Act
HyPPO
• HyPPO is a Hybrid Performance-aware Power-capping 

Orchestrator for Kubernetes environments
- leverage run-7me monitoring data coming from DEEP-mon
- guarantee SLAs for each container
- autonomic management of SLAs and power consump7on
- hybrid: HW power capping, SW resource management
26
Energy propor:onality
The resources I use == the energy bill I pay
Performance first
Guaranteed user experience, saving power
HyPPO at a glance
• What if we try to slow-down some components?
- reducing performance means reducing power usage (most of the 7me)
- do it only without affec7ng end user experience (and SLAs)
- give performance back when needed
• Opera:ve steps
- measure CPU usage and power consump7on in real-7me
- reason on new possible alloca7ons
- act based on previous decisions
• Technical challenges
- how to operate on a distributed system
- workload instrumenta7on and power alribu7on
- autonomicity in decision process is key, keep an eye on goals and requirements
- how to push up & down performance and power usage
27
O
A D
Distributed ODA loop
28
Master
Node Node
API
API
Pod Pod
API
Pod Pod
DEEP-mon agent DEEP-mon agent
Monitor
Backend
ACTUATOR
AGENT
ACTUATOR
AGENT
HyPPO
HyPPO controller
29
mples of metrics and Kubernetes status from each monitoring agent in the GRPC collector.
base for later use. Monitoring samples are unpacked inside the Metrics workers and aggregated
ored in an InfluxDB database, which is queried by the monitoring frontend to show real-time
loop components.
hen access
the REST
ueries the
the latter
gramming
the others,
ge, Power
time and
metrics by
nt of our
composed
ch data to
utcome to
menting a
node of the cluster, powern,c is the power consumed by the
c-th container running on the n-th node.
powern = Pidle +
CX
c=0
(powern,c + i(c)) (1)
Equation (1) defines the power for the n-th node as the sum
of the idle power Pidle and the sums of all the powers
consumed by the c-th container running on the n-th node,
plus a contribute i(c). The contribution can be positive or
negative and is expressed in Equation (2), where cpu requestc
represents the CPU request expressed for the c-th container,
cpu usagec represent the actual CPU consumption for the c-
th container and P represents a proportional factor that can
be defined in controller configurations. Each container CPU
usage data point passes through the controller represented by
Equations (1) and (2). In this way, it contributes positively
or negatively to the total power consumption of the node on
e the latter
ogramming
g the others,
sage, Power
n time and
metrics by
ent of our
composed
uch data to
outcome to
ementing a
tuators.
ST endpoint
ashion (the
controller).
nformation:
ners power
first set of
dictionary
indicating
powern = Pidle +
CX
c=0
(powern,c + i(c)) (1)
Equation (1) defines the power for the n-th node as the sum
of the idle power Pidle and the sums of all the powers
consumed by the c-th container running on the n-th node,
plus a contribute i(c). The contribution can be positive or
negative and is expressed in Equation (2), where cpu requestc
represents the CPU request expressed for the c-th container,
cpu usagec represent the actual CPU consumption for the c-
th container and P represents a proportional factor that can
be defined in controller configurations. Each container CPU
usage data point passes through the controller represented by
Equations (1) and (2). In this way, it contributes positively
or negatively to the total power consumption of the node on
which the container is actually running.
i(c) =
(
(cpu usagec cpu requestc) ⇤ P if 9cpu usagec
0 if 6 9cpu usagec
(2)
The P parameter was chosen after several experiments and
represent the pace at which the controller tries to fill the
opportunity gap. It is defined in the configuration file of the
controller as 10000 mW (10W). In case a container does not
For each host we compute the power cap to be enforced depending on running containers
CONTROLLER
power of node n
power of idle system power of the container
power adjusting
under/over utilisation condition proportional factor (10W)
Power is then adjusted depending on how the CPU is used
Actua7on
30
ACTUATOR
AGENT
RAPL MSR
Power unit Power limit
Power budget node I
• For each host we have a power budget
• We leverage Intel RAPL to enforce the
power capping on the system
• Then we translate the power budget
into power units
• Finally we write in the power limit of
the RAPL MSR the power cap
Experimental setup
31
Testbed
Kubernetes cluster composed by 2 homogenous nodes
Node specs: Dell PowerEdge r720xd, 2x Intel Xeon E5-2680 Ivy Bridge (20 HT), 2.80GHz, 380GB of RAM
Benchmarks: Phoronix Test Suite, CPU request = 5 cores (5000 millicpus)
apache-cpu CPU Request
CPU%
0
200
400
Execution Time [s]
0 20 40 60 80 100 120 140 160
Apache CPU Opportunity Gap
apache-cpu CPU Request
CPU%
0
200
400
Execution Time [s]
0 20 40 60 80 100 120 140 160
Apache CPU Opportunity Gap
Goal
HyPPO should be able to guarantee performance
and at the same 7me try to reduce power usage
Experimental results (1): Apache
32
Hyppo provides good performance,
5% SLA violation
apache-cpu apache-cpu-ctrl CPU Request
CPU%
0
200
400
600
Execution Time [s]
0 20 40 60 80 100 120 140 160
Apache CPU usage
apache-pw apache-pw-ctrl
Power[mW]
0
20000
40000
60000
Execution Time [s]
0 10 20 30 40 50 60 70 80
Apache Power consumed
Experimental results (2): Fio
33
fio-cpu fio-cpu-ctrl CPU Request
CPU%
0
200
400
Execution Time [s]
0 20 40 60 80 100 120 140
FIO CPU Usage
Power saving,
but higher benchmark execution time
fio-pw fio-pw-ctrl
Power[mW]
0
20000
40000
Execution Time [s]
0 10 20 30 40 50 60 70
FIO Power Consumption
Experimental results (3): Nginx
34
nginx-cpu nginx-cpu-ctrl CPU Request
CPU%
0
200
400
Execution Time [s]
0 20 40 60 80 100 120 140 160
NGINX CPU Usage
Good power saving results,
Few effects on NGINX CPU usage
(few threads)
nginx-pw nginx-pw-ctrl
Power[mW]
0
20000
40000
60000
Execution Time [s]
0 10 20 30 40 50 60 70 80
NGINX Power Consumption
Experimental results (3): Nginx
35
nginx-cpu nginx-cpu-ctrl CPU Request
CPU%
0
200
400
Execution Time [s]
0 20 40 60 80 100 120 140 160
NGINX CPU Usage
Good power saving results,
Few effects on NGINX CPU usage
(few threads)
nginx-pw nginx-pw-ctrl
Power[mW]
0
20000
40000
60000
Execution Time [s]
0 10 20 30 40 50 60 70 80
NGINX Power Consumption
Preliminary tests showed a power saving ranging from 5% to 45% 

and SLA violations of 2,5% on average
HyPPO currently works well with multithread workloads
36
Some7mes we cannot slow down applica7ons
Heterogeneity helps us to meet SLAs and improve energy efficiency
37Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon
Jackson, et al. An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems. In
Proceed- ings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems,
pages 3–18, 2019.
Batches of DeathStars
Problem defini7on
FPGAVM
App
Runtime
req/s
time
38
Problem defini7on
FPGA
req/s
time
39
FPGAVM
App
Runtime
VM
App
Runtime
Problem defini7on
FPGAVM
App
Runtime
FPGAVM
App
Runtime
FPGA
VM
App
Runtime
40
Our approach: BlastFunc7on
FPGA
App
FPGA
App
FPGA
App
41
Our approach: BlastFunc7on
FPGA
App
FPGA
App
FPGA
App
42
BlastFunc7on in a nutshell
FPGA Sharing system for Microservices and Serverless functions
Reconfiguration-aware and Accelerator-independent
Transparent integration and scalability
43
Fully integrated with existing cloud orchestrator (Kubernetes)
Containers / Functions
Orchestrator
Accelerator
Registry
Node
FPGA
Device
Manager
FPGA
Device
Manager
High level architecture
Node
FPGA
Device
Manager
App
44
Remote OpenCL library
Connects to Registry/Device Manager to access the remote Device
Synchronous + Asynchronous ac7ons
Supports multiple kernels, accelerators and concurrent queues
Transparent OpenCL wrapper
45
Easy integration (dynamic library w/OpenCL loader)
App
Device Manager
Controls a single underlying FPGA device
Exposes gRPC and shared memory interfaces
to perform actions through the device
Reconfiguration-aware
Group tasks (read/exec/write) based on source
function
46
Device
Manager
App
Ex. Thread
DeviceManager
Single Task
Buffer Write
Kernel Run
Buffer Read
endpoint
AppApp
FPGA
Accelerators Registry
Master component of the system
Contains all the informations about functions and devices
Performs allocation before and after function deployment
Integrated with the orchestrator through Hooks and APIs
Allocation based on runtime metrics of the system
47
Accelerator
Registry
Experimental evalua7on
48
• Master node: Intel Xeon-W3530, 24GB RAM
• 2 Worker nodes: Intel i7-6700, 32GB RAM
• Terasic DE5a-Net FPGA on all nodes
• Overhead, perf. behaviour on single / multi-app
Sobel Filter Matrix Multiplication
Test Applications/Kernels
CNNs
Setup
Overhead results - I/O latencies
49
• Overhead given by memory copy operations
• 3 data copies for BlastFunction, 1 data copy for BlastFunction shm
• ~47% latency slowdown w.r.t native execution (for I/O only operations)
• One data copy to guarantee OpenCL transparency
Overhead results
50
Sobel Matrix Multiplication
• BlastFunction overhead starts from 2.46 ms arriving to 24ms w.r.t. Native
• BlastFunction shm keeps a 2ms overhead (24.04%) for all executions
w.r.t. Native from images of 10x10 pixels to 1920x1080 pixels
• Shared memory provides limited and acceptable overhead
• Minimum RTT of 2ms for both BlastFunction and BlastFunction shm
• BlastFunction shm has a maximum overhead of 17ms (0.27%) w.r.t.
native (over 3.571s of Native execution time at the largest input size)
• Shared memory provides negligible overhead
Mul7-applica7on results (sobel accelerator)
51
Throughput
1.40x
% U:liza:on
1.46x
Applica:ons
5 vs 3
BLF, Low load
BLF, Medium load
BLF, High load
Native, Low load
Native, Medium load
Native, High load
Latency (max)
1.04x
Mul7-applica7on results (matmult accelerator)
52
Throughput
2.15x
% U:liza:on
1.17x
Applica:ons
5 vs 3
BLF, Low load
BLF, Medium load
BLF, High load
Native, Low load
Native, Medium load
Native, High load
Latency (max)
0.59x
Mul7-applica7on results (CNN accelerator)
53
Throughput
1.26x
% U:liza:on
1.06x
Applica:ons
5 vs 3
BLF, Low load
BLF, High load
Native, Low load
Native, High load
Latency (max)
1.40x
Conclusion
• Cloud-na7ve applica7ons monitoring: performance and power observa7ons
- Black-box power monitoring approach
- Performance counters, power consump7on, CPU % measured w/ negligible overhead
• Cloud-na7ve performance-aware power capping orchestra7on
- Distributed power capping for cloud-na7ve applica7ons
- 25% average power savings, 5% SLA viola7on rate, works well w/ mul7thread workloads
• Looking forward: accelerated cloud-na7ve workloads
- Improve performance of cloud-na7ve microservices w/ FPGA-based systems
- Sharing improves u7liza7on of the FPGAs with acceptable latency degrada7on
54
Future work
55
Latency
Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon
Jackson, et al. An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems. In
Proceed- ings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems,
pages 3–18, 2019.
Acknowledgements
I would like to thank these amazing people for all their work and support
towards the realiza7on of this work (and many others)
56
Fabiola Casasopra
Luca Danelux
Luca Malagux
Giorgia Fiscalex
Sara Notargiacomo
Marco Santambrogio
Tommaso Sardelli
Marco Arnaboldi
Marco Bacis
Andrea Strada
Daniele Rossex
Samuele Barbieri

More Related Content

PDF
Run-time power management in cloud and containerized environments
PDF
Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...
PDF
CoolDC'16: Seeing into a Public Cloud: Monitoring the Massachusetts Open Cloud
PDF
COMPARISON OF ENERGY OPTIMIZATION CLUSTERING ALGORITHMS IN WIRELESS SENSOR NE...
PDF
IRJET- An Energy-Saving Task Scheduling Strategy based on Vacation Queuing & ...
PDF
Intelligent Workload Management in Virtualized Cloud Environment
PDF
Intelligent Placement of Datacenter for Internet Services
PPTX
Augmenting Amdahl's Second Law for Cost-Effective and Balanced HPC Infrastruc...
Run-time power management in cloud and containerized environments
Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...
CoolDC'16: Seeing into a Public Cloud: Monitoring the Massachusetts Open Cloud
COMPARISON OF ENERGY OPTIMIZATION CLUSTERING ALGORITHMS IN WIRELESS SENSOR NE...
IRJET- An Energy-Saving Task Scheduling Strategy based on Vacation Queuing & ...
Intelligent Workload Management in Virtualized Cloud Environment
Intelligent Placement of Datacenter for Internet Services
Augmenting Amdahl's Second Law for Cost-Effective and Balanced HPC Infrastruc...

What's hot (20)

PDF
IRJET- Optimization with PSO and FPO based Control for Energy Efficient of Se...
PDF
Detecting Lateral Movement with a Compute-Intense Graph Kernel
PDF
9.distributive energy efficient adaptive clustering protocol for wireless sen...
PDF
A Scalable and Distributed Electrical Power Monitoring System Utilizing Cloud...
PDF
Optimization of energy consumption in cloud computing datacenters
PDF
IRJET- An Efficient Dynamic Deputy Cluster Head Selection Method for Wireless...
PDF
Applying Cloud Techniques to Address Complexity in HPC System Integrations
PDF
Survey: An Optimized Energy Consumption of Resources in Cloud Data Centers
PDF
IRJET- Sink Mobility based Energy Efficient Routing Protocol for Wireless Sen...
PDF
Quality of Service based Task Scheduling Algorithms in Cloud Computing
PDF
Load Balancing in Cloud Computing Through Virtual Machine Placement
PDF
Energy-Efficient Hybrid K-Means Algorithm for Clustered Wireless Sensor Netw...
PPTX
PDF
Hybrid Task Scheduling Approach using Gravitational and ACO Search Algorithm
PDF
REGION BASED DATA CENTRE RESOURCE ANALYSIS FOR BUSINESSES
PDF
PDF
Particle Swarm Optimization (PSO)-Based Distributed Power Control Algorithm f...
PDF
G03202048050
PDF
Energy Efficient Change Management in a Cloud Computing Environment
PDF
A Novel Cluster-Based Energy Efficient Routing With Hybrid Protocol in Wirele...
IRJET- Optimization with PSO and FPO based Control for Energy Efficient of Se...
Detecting Lateral Movement with a Compute-Intense Graph Kernel
9.distributive energy efficient adaptive clustering protocol for wireless sen...
A Scalable and Distributed Electrical Power Monitoring System Utilizing Cloud...
Optimization of energy consumption in cloud computing datacenters
IRJET- An Efficient Dynamic Deputy Cluster Head Selection Method for Wireless...
Applying Cloud Techniques to Address Complexity in HPC System Integrations
Survey: An Optimized Energy Consumption of Resources in Cloud Data Centers
IRJET- Sink Mobility based Energy Efficient Routing Protocol for Wireless Sen...
Quality of Service based Task Scheduling Algorithms in Cloud Computing
Load Balancing in Cloud Computing Through Virtual Machine Placement
Energy-Efficient Hybrid K-Means Algorithm for Clustered Wireless Sensor Netw...
Hybrid Task Scheduling Approach using Gravitational and ACO Search Algorithm
REGION BASED DATA CENTRE RESOURCE ANALYSIS FOR BUSINESSES
Particle Swarm Optimization (PSO)-Based Distributed Power Control Algorithm f...
G03202048050
Energy Efficient Change Management in a Cloud Computing Environment
A Novel Cluster-Based Energy Efficient Routing With Hybrid Protocol in Wirele...
Ad

Similar to HYPPO - NECSTTechTalk 23/04/2020 (20)

PDF
DEEP-mon: Dynamic and Energy Efficient Power monitoring for container-based i...
PDF
Servant-ModLeach Energy Efficient Cluster Base Routing Protocol for Large Sca...
PDF
IRJET- An Enhanced Cluster (CH-LEACH) based Routing Scheme for Wireless Senso...
PDF
[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xe...
PDF
DEEP-mon: Dynamic and Energy Efficient Power monitoring for container-based i...
PDF
A Brief Survey of Current Power Limiting Strategies
PDF
Keep Calm and React with Foresight: Strategies for Low-Latency and Energy-Eff...
PPTX
Power Comparison Power Comparison of Cloud Data of Cloud Data Center Architec...
PDF
Residual Energy Based Cluster head Selection in WSNs for IoT Application
PDF
Energy aware load balancing and application scaling for the cloud ecosystem
PDF
[EUC2016] DockerCap: a software-level power capping orchestrator for Docker c...
PDF
System on Chip Based RTC in Power Electronics
PDF
Performance and Energy evaluation
PDF
Optical Switching in the Datacenter
PDF
Managing Grid Constraints with Active Management Systems
PDF
Energy-aware Load Balancing and Application Scaling for the Cloud Ecosystem
PDF
A NEW DATA ENCODER AND DECODER SCHEME FOR NETWORK ON CHIP
PDF
IRJET- Load Frequency Control of a Renewable Source Integrated Four Area ...
PDF
IRJET- DOE to Minimize the Energy Consumption of RPL Routing Protocol in IoT ...
PDF
ENERGY CONSUMPTION IMPROVEMENT OF TRADITIONAL CLUSTERING METHOD IN WIRELESS S...
DEEP-mon: Dynamic and Energy Efficient Power monitoring for container-based i...
Servant-ModLeach Energy Efficient Cluster Base Routing Protocol for Large Sca...
IRJET- An Enhanced Cluster (CH-LEACH) based Routing Scheme for Wireless Senso...
[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xe...
DEEP-mon: Dynamic and Energy Efficient Power monitoring for container-based i...
A Brief Survey of Current Power Limiting Strategies
Keep Calm and React with Foresight: Strategies for Low-Latency and Energy-Eff...
Power Comparison Power Comparison of Cloud Data of Cloud Data Center Architec...
Residual Energy Based Cluster head Selection in WSNs for IoT Application
Energy aware load balancing and application scaling for the cloud ecosystem
[EUC2016] DockerCap: a software-level power capping orchestrator for Docker c...
System on Chip Based RTC in Power Electronics
Performance and Energy evaluation
Optical Switching in the Datacenter
Managing Grid Constraints with Active Management Systems
Energy-aware Load Balancing and Application Scaling for the Cloud Ecosystem
A NEW DATA ENCODER AND DECODER SCHEME FOR NETWORK ON CHIP
IRJET- Load Frequency Control of a Renewable Source Integrated Four Area ...
IRJET- DOE to Minimize the Energy Consumption of RPL Routing Protocol in IoT ...
ENERGY CONSUMPTION IMPROVEMENT OF TRADITIONAL CLUSTERING METHOD IN WIRELESS S...
Ad

More from NECST Lab @ Politecnico di Milano (20)

PDF
Mesticheria Team - WiiReflex
PPTX
Punto e virgola Team - Stressometro
PDF
BitIt Team - Stay.straight
PDF
BabYodini Team - Talking Gloves
PDF
printf("Nome Squadra"); Team - NeoTon
PPTX
BlackBoard Team - Motion Tracking Platform
PDF
#include<brain.h> Team - HomeBeatHome
PDF
Flipflops Team - Wave U
PDF
Bug(atta) Team - Little Brother
PDF
#NECSTCamp: come partecipare
PDF
NECSTLab101 2020.2021
PDF
TreeHouse, nourish your community
PDF
TiReX: Tiled Regular eXpressionsmatching architecture
PDF
Embedding based knowledge graph link prediction for drug repurposing
PDF
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PDF
EMPhASIS - An EMbedded Public Attention Stress Identification System
PDF
Luns - Automatic lungs segmentation through neural network
PDF
BlastFunction: How to combine Serverless and FPGAs
PDF
Maeve - Fast genome analysis leveraging exact string matching
Mesticheria Team - WiiReflex
Punto e virgola Team - Stressometro
BitIt Team - Stay.straight
BabYodini Team - Talking Gloves
printf("Nome Squadra"); Team - NeoTon
BlackBoard Team - Motion Tracking Platform
#include<brain.h> Team - HomeBeatHome
Flipflops Team - Wave U
Bug(atta) Team - Little Brother
#NECSTCamp: come partecipare
NECSTLab101 2020.2021
TreeHouse, nourish your community
TiReX: Tiled Regular eXpressionsmatching architecture
Embedding based knowledge graph link prediction for drug repurposing
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
EMPhASIS - An EMbedded Public Attention Stress Identification System
Luns - Automatic lungs segmentation through neural network
BlastFunction: How to combine Serverless and FPGAs
Maeve - Fast genome analysis leveraging exact string matching

Recently uploaded (20)

PPTX
Module1.pptxrjkeieuekwkwoowkemehehehrjrjrj
PDF
Lesson 3 .pdf
PDF
VSL-Strand-Post-tensioning-Systems-Technical-Catalogue_2019-01.pdf
PDF
Beginners-Guide-to-Artificial-Intelligence.pdf
PDF
VTU IOT LAB MANUAL (BCS701) Computer science and Engineering
PPTX
Quality engineering part 1 for engineering undergraduates
PPT
Programmable Logic Controller PLC and Industrial Automation
PPTX
Environmental studies, Moudle 3-Environmental Pollution.pptx
PDF
Mechanics of materials week 2 rajeshwari
PDF
IAE-V2500 Engine Airbus Family A319/320
PPTX
Wireless sensor networks (WSN) SRM unit 2
PPTX
Solar energy pdf of gitam songa hemant k
PDF
Project_Mgmt_Institute_-Marc Marc Marc .pdf
PDF
Principles of operation, construction, theory, advantages and disadvantages, ...
DOCX
An investigation of the use of recycled crumb rubber as a partial replacement...
PPTX
DATA STRCUTURE LABORATORY -BCSL305(PRG1)
PDF
MLpara ingenieira CIVIL, meca Y AMBIENTAL
PPTX
BBOC407 BIOLOGY FOR ENGINEERS (CS) - MODULE 1 PART 1.pptx
PPTX
MAD Unit - 3 User Interface and Data Management (Diploma IT)
PDF
IAE-V2500 Engine for Airbus Family 319/320
Module1.pptxrjkeieuekwkwoowkemehehehrjrjrj
Lesson 3 .pdf
VSL-Strand-Post-tensioning-Systems-Technical-Catalogue_2019-01.pdf
Beginners-Guide-to-Artificial-Intelligence.pdf
VTU IOT LAB MANUAL (BCS701) Computer science and Engineering
Quality engineering part 1 for engineering undergraduates
Programmable Logic Controller PLC and Industrial Automation
Environmental studies, Moudle 3-Environmental Pollution.pptx
Mechanics of materials week 2 rajeshwari
IAE-V2500 Engine Airbus Family A319/320
Wireless sensor networks (WSN) SRM unit 2
Solar energy pdf of gitam songa hemant k
Project_Mgmt_Institute_-Marc Marc Marc .pdf
Principles of operation, construction, theory, advantages and disadvantages, ...
An investigation of the use of recycled crumb rubber as a partial replacement...
DATA STRCUTURE LABORATORY -BCSL305(PRG1)
MLpara ingenieira CIVIL, meca Y AMBIENTAL
BBOC407 BIOLOGY FOR ENGINEERS (CS) - MODULE 1 PART 1.pptx
MAD Unit - 3 User Interface and Data Management (Diploma IT)
IAE-V2500 Engine for Airbus Family 319/320

HYPPO - NECSTTechTalk 23/04/2020

  • 1. Managing power and performance trade-offs in distributed cloud-na7ve infrastructures NECST (Thursday) Friday talk, 04/23/2020 Rolando Brondolin <[email protected]>
  • 2. 2
  • 3. 3
  • 5. 5
  • 6. 6
  • 7. A cloudy landscape Cloud services became more structured and variegated in the last few years 7 Physical Hardware VM1 VM2 C1 C2 C3 C4 The complexity of the environment is left to the Cloud provider
  • 8. 8 Data-centers will consume 8% of the energy consump7on of the world by 2030 [1] [1] Anders SG Andrae and Tomas Edler. On global electricity usage of communication technology: trends to 2030. Challenges, 6(1):117–157, 2015.
  • 9. The need for power awareness 9 [2] Beloglazov, A., Buyya, R., Lee, Y. C., Zomaya, A., et Al, taxonomy and survey of energy-efficient datacenters and cloud compu7ng systems. Advances in computers 82, 2 (2011) 47–111. [1] Cui, Yan, et al. "Total cost of ownership model for data center technology evaluation." Thermal and Thermomechanical Phenomena in Electronic Systems (ITherm), 2017 16th IEEE Intersociety Conference on. IEEE, 2017. Power consumption is accounted for 20% of a data-center TCO [1] Lifetime energy cost will exceed hardware cost in the near future [2] Energy budgets and power caps constraint the performance of the system Power consumption is affected by a plethora of different actors Performance are key for production systems despite power consumption
  • 10. The need for power awareness 10 [2] Beloglazov, A., Buyya, R., Lee, Y. C., Zomaya, A., et Al, taxonomy and survey of energy-efficient datacenters and cloud compu7ng systems. Advances in computers 82, 2 (2011) 47–111. [1] Cui, Yan, et al. "Total cost of ownership model for data center technology evaluation." Thermal and Thermomechanical Phenomena in Electronic Systems (ITherm), 2017 16th IEEE Intersociety Conference on. IEEE, 2017. Power consumption is accounted for 20% of a data-center TCO [1] Lifetime energy cost will exceed hardware cost in the near future [2] Energy budgets and power caps constraint the performance of the system Power consumption is affected by a plethora of different actors Performance are key for production systems despite power consumption Tradeoff between power consumption and performance Applications provide a service to the user: • performance should be guaranteed • power saving comes from run-time management
  • 11. Cloud-na7ve technologies 11 According to the CNCF[1]: "Cloud native technologies empower organizations to build and run scalable applications in modern, dynamic environments such as public, private, and hybrid clouds. Containers, service meshes, microservices, immutable infrastructure, and declarative APIs exemplify this approach." [1] https://github.com/cncf/toc/blob/master/DEFINITION.md
  • 12. 12Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, et al. An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems. In Proceed- ings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 3–18, 2019. We said DeathStars…
  • 13. Power vs performance: challenges • how we can measure the behavior of cloud-na7ve applica7ons and the environment in terms of performance and power consump:on
 • how we can define performance throughout cloud-na7ve applica7ons and how we can define meaningful performance targets
 • how we can effec7vely and precisely reduce power consump:on while preserving performances of the running workloads 13
  • 14. Run-7me power management 14 O A D Instrumentation-free observation •Performance monitor •Power monitor and attribution Decide and optimize •Reactive/predictive Power-aware control •Control policies based on overall goals Fast actuation •Enforce policies to save power •Tune power and resource allocation Proposed general approach Observe - Decide - Act loop Goal: Introduce autonomicity to improve energy efficiency and introduce energy propor:onality of cloud-na7ve applica7ons
  • 15. Run-7me power management 15 O A D RAPL CPU quota Thread affinity DEEP-mon • per-container power monitor • per-container performance measurements and PMCs • per-request latency measurements HyPPO • reac:ve control • CPU usage & request • power capping Act
  • 16. Run-7me power management 16 O A D RAPL CPU quota Thread affinity DEEP-mon • per-container power monitor • per-container performance measurements and PMCs • per-request latency measurements HyPPO • reac:ve control • CPU usage & request • power capping Act
  • 17. DEEP-mon at a glance 17 • DEEP-mon is a HT-aware fine-grain power monitor for container-based environments - precise power aGribu:on to containers - instrumenta:on free, watch workloads from outside - lightweight, with lille overhead on the target workloads and systems - scalable and distributed, to observe Kubernetes clusters • Monitoring ingredients: Container execution Resource usage Power consumption Context switch Performance Counter (cycle)* Intel RAPL * cycles has 99% correlation w.r.t. CPU power usage
  • 18. Anatomy of a power monitoring agent 18 user-space kernel-space Intel RAPL DEEP-mon Power attribution Docker and Kubernetes metrics kernel tracing PMC context switch Linux CFS Monitoring back-end
  • 19. Anatomy of a power monitoring agent 19 user-space kernel-space Intel RAPL DEEP-mon Power attribution Docker and Kubernetes metrics PMC Monitoring back-end 200K evts/s kernel tracing context switch Linux CFS
  • 20. Kernel level data acquisi7on (1) • We cannot send each context switch to user-space - too many events per second to process - too much overhead • Introduce in-kernel data aggrega7on: 20 eBPF and BCC: build, inject and execute code in a Kernel VM trace context switch, count PMCs on the fly store data in eBPF data structures send one big event instead of many small ones DEEP-mon kernel
  • 21. Correlate power and performance • At fixed 7me intervals we collect the thread map • Then we extract power measurement from RAPL and we account it for each thread: 21 eBPF output Thread1 Thread2 Thread3 thread map G benchmarks from NPB with HT experiments pin two threads ysical core run on a Dell PowerEdge n E5-2680 v2 (10 cores and with Ubuntu Linux st experiment shows that cal cores mapped on the umption is ' 1.15 with on that same physical execution periods in which the thread was co-running on the same physical core via HT, weighted by the HTr ratio and divided by 2 to equally divide the overlapping cycles among the two threads. In this context an execution period is defined as the time between context switches on the physical core where the thread is scheduled. Starting from Equation (1), we can now attribute the power measured by RAPL for our thread T1 following Equation (2), where |K| is the cardinality of the set K of threads running in the server in a given period of time and |S| is the cardinality of the set S of sockets in the system. PT 1(t) = |S| X s=0 RAPLcore(t, s) · CyclesT W1 (t, s) P|K| k=0 CyclesT Wk (t, s) ! (2) Starting from this result, the next sections will provide details on how we implemented power attribution for each thread and container running in the system. B. Kernel level data acquisition The power attribution model described in Section III-A needs a precise measurement of the performance counter Power of Thread 1 Sum among all sockets RAPL measurement of the socket Thread weight inside the socket power consumption • Finally we group each thread by container DEEP-mon kernel
  • 22. Monitoring containers at scale • Once power data is collected, we can send it to a back-end 
 on a regular basis - further aggrega7on of metrics data - Kubernetes cluster level view • Backend exposes data for visualiza:on and autonomic power management 22
  • 23. Benchmarks Cloud Benchmarks: Phoronix test suite pts/apache, pts/Nginx, pts/fio HPC Benchmarks: NAS Parallel benchmarks EP, MG, CG Experimental results 23 Network and syscall intensive benchmarks CPU and memory intensive tasks Cloud benchmarks app overhead < 3.3% HPC benchmarks app overhead < 4% Cloud benchmarks power overhead 1.74% avg HPC benchmarks power overhead 0.90% avg Evalua:on goals Monitoring should introduce minimum overhead We evaluated DEEP-mon w.r.t. its overhead on applica:ons and the target system
  • 25. Run-7me power management 25 O A D RAPL CPU quota Thread affinity DEEP-mon • per-container power monitor • per-container performance measurements and PMCs • per-request latency measurements HyPPO • reac:ve control • CPU usage & request • power capping Act
  • 26. HyPPO • HyPPO is a Hybrid Performance-aware Power-capping 
 Orchestrator for Kubernetes environments - leverage run-7me monitoring data coming from DEEP-mon - guarantee SLAs for each container - autonomic management of SLAs and power consump7on - hybrid: HW power capping, SW resource management 26 Energy propor:onality The resources I use == the energy bill I pay Performance first Guaranteed user experience, saving power
  • 27. HyPPO at a glance • What if we try to slow-down some components? - reducing performance means reducing power usage (most of the 7me) - do it only without affec7ng end user experience (and SLAs) - give performance back when needed • Opera:ve steps - measure CPU usage and power consump7on in real-7me - reason on new possible alloca7ons - act based on previous decisions • Technical challenges - how to operate on a distributed system - workload instrumenta7on and power alribu7on - autonomicity in decision process is key, keep an eye on goals and requirements - how to push up & down performance and power usage 27 O A D
  • 28. Distributed ODA loop 28 Master Node Node API API Pod Pod API Pod Pod DEEP-mon agent DEEP-mon agent Monitor Backend ACTUATOR AGENT ACTUATOR AGENT HyPPO
  • 29. HyPPO controller 29 mples of metrics and Kubernetes status from each monitoring agent in the GRPC collector. base for later use. Monitoring samples are unpacked inside the Metrics workers and aggregated ored in an InfluxDB database, which is queried by the monitoring frontend to show real-time loop components. hen access the REST ueries the the latter gramming the others, ge, Power time and metrics by nt of our composed ch data to utcome to menting a node of the cluster, powern,c is the power consumed by the c-th container running on the n-th node. powern = Pidle + CX c=0 (powern,c + i(c)) (1) Equation (1) defines the power for the n-th node as the sum of the idle power Pidle and the sums of all the powers consumed by the c-th container running on the n-th node, plus a contribute i(c). The contribution can be positive or negative and is expressed in Equation (2), where cpu requestc represents the CPU request expressed for the c-th container, cpu usagec represent the actual CPU consumption for the c- th container and P represents a proportional factor that can be defined in controller configurations. Each container CPU usage data point passes through the controller represented by Equations (1) and (2). In this way, it contributes positively or negatively to the total power consumption of the node on e the latter ogramming g the others, sage, Power n time and metrics by ent of our composed uch data to outcome to ementing a tuators. ST endpoint ashion (the controller). nformation: ners power first set of dictionary indicating powern = Pidle + CX c=0 (powern,c + i(c)) (1) Equation (1) defines the power for the n-th node as the sum of the idle power Pidle and the sums of all the powers consumed by the c-th container running on the n-th node, plus a contribute i(c). The contribution can be positive or negative and is expressed in Equation (2), where cpu requestc represents the CPU request expressed for the c-th container, cpu usagec represent the actual CPU consumption for the c- th container and P represents a proportional factor that can be defined in controller configurations. Each container CPU usage data point passes through the controller represented by Equations (1) and (2). In this way, it contributes positively or negatively to the total power consumption of the node on which the container is actually running. i(c) = ( (cpu usagec cpu requestc) ⇤ P if 9cpu usagec 0 if 6 9cpu usagec (2) The P parameter was chosen after several experiments and represent the pace at which the controller tries to fill the opportunity gap. It is defined in the configuration file of the controller as 10000 mW (10W). In case a container does not For each host we compute the power cap to be enforced depending on running containers CONTROLLER power of node n power of idle system power of the container power adjusting under/over utilisation condition proportional factor (10W) Power is then adjusted depending on how the CPU is used
  • 30. Actua7on 30 ACTUATOR AGENT RAPL MSR Power unit Power limit Power budget node I • For each host we have a power budget • We leverage Intel RAPL to enforce the power capping on the system • Then we translate the power budget into power units • Finally we write in the power limit of the RAPL MSR the power cap
  • 31. Experimental setup 31 Testbed Kubernetes cluster composed by 2 homogenous nodes Node specs: Dell PowerEdge r720xd, 2x Intel Xeon E5-2680 Ivy Bridge (20 HT), 2.80GHz, 380GB of RAM Benchmarks: Phoronix Test Suite, CPU request = 5 cores (5000 millicpus) apache-cpu CPU Request CPU% 0 200 400 Execution Time [s] 0 20 40 60 80 100 120 140 160 Apache CPU Opportunity Gap apache-cpu CPU Request CPU% 0 200 400 Execution Time [s] 0 20 40 60 80 100 120 140 160 Apache CPU Opportunity Gap Goal HyPPO should be able to guarantee performance and at the same 7me try to reduce power usage
  • 32. Experimental results (1): Apache 32 Hyppo provides good performance, 5% SLA violation apache-cpu apache-cpu-ctrl CPU Request CPU% 0 200 400 600 Execution Time [s] 0 20 40 60 80 100 120 140 160 Apache CPU usage apache-pw apache-pw-ctrl Power[mW] 0 20000 40000 60000 Execution Time [s] 0 10 20 30 40 50 60 70 80 Apache Power consumed
  • 33. Experimental results (2): Fio 33 fio-cpu fio-cpu-ctrl CPU Request CPU% 0 200 400 Execution Time [s] 0 20 40 60 80 100 120 140 FIO CPU Usage Power saving, but higher benchmark execution time fio-pw fio-pw-ctrl Power[mW] 0 20000 40000 Execution Time [s] 0 10 20 30 40 50 60 70 FIO Power Consumption
  • 34. Experimental results (3): Nginx 34 nginx-cpu nginx-cpu-ctrl CPU Request CPU% 0 200 400 Execution Time [s] 0 20 40 60 80 100 120 140 160 NGINX CPU Usage Good power saving results, Few effects on NGINX CPU usage (few threads) nginx-pw nginx-pw-ctrl Power[mW] 0 20000 40000 60000 Execution Time [s] 0 10 20 30 40 50 60 70 80 NGINX Power Consumption
  • 35. Experimental results (3): Nginx 35 nginx-cpu nginx-cpu-ctrl CPU Request CPU% 0 200 400 Execution Time [s] 0 20 40 60 80 100 120 140 160 NGINX CPU Usage Good power saving results, Few effects on NGINX CPU usage (few threads) nginx-pw nginx-pw-ctrl Power[mW] 0 20000 40000 60000 Execution Time [s] 0 10 20 30 40 50 60 70 80 NGINX Power Consumption Preliminary tests showed a power saving ranging from 5% to 45% 
 and SLA violations of 2,5% on average HyPPO currently works well with multithread workloads
  • 36. 36 Some7mes we cannot slow down applica7ons Heterogeneity helps us to meet SLAs and improve energy efficiency
  • 37. 37Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, et al. An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems. In Proceed- ings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 3–18, 2019. Batches of DeathStars
  • 43. BlastFunc7on in a nutshell FPGA Sharing system for Microservices and Serverless functions Reconfiguration-aware and Accelerator-independent Transparent integration and scalability 43 Fully integrated with existing cloud orchestrator (Kubernetes)
  • 45. Remote OpenCL library Connects to Registry/Device Manager to access the remote Device Synchronous + Asynchronous ac7ons Supports multiple kernels, accelerators and concurrent queues Transparent OpenCL wrapper 45 Easy integration (dynamic library w/OpenCL loader) App
  • 46. Device Manager Controls a single underlying FPGA device Exposes gRPC and shared memory interfaces to perform actions through the device Reconfiguration-aware Group tasks (read/exec/write) based on source function 46 Device Manager App Ex. Thread DeviceManager Single Task Buffer Write Kernel Run Buffer Read endpoint AppApp FPGA
  • 47. Accelerators Registry Master component of the system Contains all the informations about functions and devices Performs allocation before and after function deployment Integrated with the orchestrator through Hooks and APIs Allocation based on runtime metrics of the system 47 Accelerator Registry
  • 48. Experimental evalua7on 48 • Master node: Intel Xeon-W3530, 24GB RAM • 2 Worker nodes: Intel i7-6700, 32GB RAM • Terasic DE5a-Net FPGA on all nodes • Overhead, perf. behaviour on single / multi-app Sobel Filter Matrix Multiplication Test Applications/Kernels CNNs Setup
  • 49. Overhead results - I/O latencies 49 • Overhead given by memory copy operations • 3 data copies for BlastFunction, 1 data copy for BlastFunction shm • ~47% latency slowdown w.r.t native execution (for I/O only operations) • One data copy to guarantee OpenCL transparency
  • 50. Overhead results 50 Sobel Matrix Multiplication • BlastFunction overhead starts from 2.46 ms arriving to 24ms w.r.t. Native • BlastFunction shm keeps a 2ms overhead (24.04%) for all executions w.r.t. Native from images of 10x10 pixels to 1920x1080 pixels • Shared memory provides limited and acceptable overhead • Minimum RTT of 2ms for both BlastFunction and BlastFunction shm • BlastFunction shm has a maximum overhead of 17ms (0.27%) w.r.t. native (over 3.571s of Native execution time at the largest input size) • Shared memory provides negligible overhead
  • 51. Mul7-applica7on results (sobel accelerator) 51 Throughput 1.40x % U:liza:on 1.46x Applica:ons 5 vs 3 BLF, Low load BLF, Medium load BLF, High load Native, Low load Native, Medium load Native, High load Latency (max) 1.04x
  • 52. Mul7-applica7on results (matmult accelerator) 52 Throughput 2.15x % U:liza:on 1.17x Applica:ons 5 vs 3 BLF, Low load BLF, Medium load BLF, High load Native, Low load Native, Medium load Native, High load Latency (max) 0.59x
  • 53. Mul7-applica7on results (CNN accelerator) 53 Throughput 1.26x % U:liza:on 1.06x Applica:ons 5 vs 3 BLF, Low load BLF, High load Native, Low load Native, High load Latency (max) 1.40x
  • 54. Conclusion • Cloud-na7ve applica7ons monitoring: performance and power observa7ons - Black-box power monitoring approach - Performance counters, power consump7on, CPU % measured w/ negligible overhead • Cloud-na7ve performance-aware power capping orchestra7on - Distributed power capping for cloud-na7ve applica7ons - 25% average power savings, 5% SLA viola7on rate, works well w/ mul7thread workloads • Looking forward: accelerated cloud-na7ve workloads - Improve performance of cloud-na7ve microservices w/ FPGA-based systems - Sharing improves u7liza7on of the FPGAs with acceptable latency degrada7on 54
  • 55. Future work 55 Latency Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, et al. An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems. In Proceed- ings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 3–18, 2019.
  • 56. Acknowledgements I would like to thank these amazing people for all their work and support towards the realiza7on of this work (and many others) 56 Fabiola Casasopra Luca Danelux Luca Malagux Giorgia Fiscalex Sara Notargiacomo Marco Santambrogio Tommaso Sardelli Marco Arnaboldi Marco Bacis Andrea Strada Daniele Rossex Samuele Barbieri