SlideShare a Scribd company logo
BUILDING SOURCE CODE LEVEL
PROFILER FOR C++ APPLICATION
Quentin Tsai
Sciwork Conference 2023
Hello!
• Graduate from NYCU
• Software QA automation Engineer @ Nvidia (RDSS)
• Software Automation Testing, Performance Testing
2
I amQuentin Tsai
quentin.tsai.tw@gmail.com
When my code is running slowly
Check Resource usage
• I/O
• Memory
• CPU usage
3
When my code is running slowly
Check Resource usage
• I/O
• Memory
• CPU usage
4
When my code is running slowly
Check Resource usage
• I/O
• Memory
• CPU usage
Identify the bottleneck
5
• Nested loops
• Excessive function calls
• Inefficient algorithm
• Improper data structure
When my code is running slowly
Check Resource usage
• I/O
• Memory
• CPU usage
Identify the bottleneck
6
Optimize the code
• Parallelization
• Memory Optimization
• Algorithm time complexity
• Nested loops
• Excessive function calls
• Inefficient algorithm
• Improper data structure
When my code is running slowly
Check Resource usage
• I/O
• Memory
• CPU usage
Identify the bottleneck
7
Optimize the code
• Parallelization
• Memory Optimization
• Algorithm time complexity
• Nested loops
• Excessive function calls
• Inefficient algorithm
• Improper data structure
But how to find the bottleneck?
Which part of my code runs slowly?
8
#include <iostream>
#include <ctime>
int main() {
// Record the start time
clock_t start = clock();
do_something();
// Record the stop time
clock_t stop = clock();
// Calculate the elapsed time
double elapsed_time = static_cast<double>(stop - start) /
CLOCKS_PER_SEC;
// Output the time taken
std::cout << "Time taken by do_something: " << elapsed_time << "
seconds" << std::endl;
return 0;
}
Measure each function respectively?
Profilers
Tools to help programmers measure and reason about performance
9
What is profiler?
10
a tool used to analyze the program runtime behavior and performance characteristics.
Sampling profiling
• Attach to program, periodically interrupt and record the on-CPU function
11
Sampling profiling
• Attach to program, periodically interrupt and record the on-CPU function
12
Time
Function c
Function d
Sampling profiling
• Attach to program, periodically interrupt and record the on-CPU function
13
Time
Function c x6
Function d
Sampling profiling
• Attach to program, periodically interrupt and record the on-CPU function
14
Time
Function c x6
Function d x3
Sampling profiling
• Attach to program, periodically interrupt and record the on-CPU function
15
Time
Function c x6
Function d x3 Focus on optimizing function c?
Sampling profiling
• Attach to program, periodically interrupt and record the on-CPU function
16
• For each sample, record stack trace
Time
Function c
Function d
Sampling profiling
• Attach to program, periodically interrupt and record the on-CPU function
17
• For each sample, record stack trace
Time
Function c
Function d
main
a
b
c
main
a
b
c
d
Instrumentation profiling
• Insert code to the program to record performance metric
• Manually inserted by programmers
• Automatically inserted via some tools
18
Sampling VS Instrumentation
Sampling
• Non-Intrusive
• Low Overhead
Instrumentation
• Inline functions are invisible
• only approximations and not accurate​​​
19
Pros Cons
• Inline function visible
• More accurate
• More customizable
• Significant overhead
• Require source code / binary rewriting
# Overhead Samples Command Shared Object Symbol
# ........ ............ ....... ................. ...................................
#
20.42% 605 bash [kernel.kallsyms] [k] xen_hypercall_xen_version
|
--- xen_hypercall_xen_version
check_events
|
|--44.13%-- syscall_trace_enter
| tracesys
| |
| |--35.58%-- __GI___libc_fcntl
| | |
| | |--65.26%-- do_redirection_internal
| | | do_redirections
| | | execute_builtin_or_function
| | | execute_simple_command
| | | execute_command_internal
| | | execute_command
| | | execute_while_or_until
| | | execute_while_command
| | | execute_command_internal
| | | execute_command
| | | reader_loop
| | | main
| | | __libc_start_main
| | |
| | --34.74%-- do_redirections
| | |
| | |--54.55%-- execute_builtin_or_function
| | | execute_simple_command
| | | execute_command_internal
| | | execute_command
| | | execute_while_or_until
| | | execute_while_command
| | | execute_command_internal
| | | execute_command
| | | reader_loop
| | | main
| | | __libc_start_main
| | |
Linux Perf
Linux built in sampling-based profiler
20
Build a simple source code level profiler
21
22
Milestone 1: Log execution time
#include <iostream>
#include <chrono>
#define START_TIMER auto start_time = std::chrono::high_resolution_clock::now();
#define STOP_TIMER(functionName) 
do { 
auto end_time = std::chrono::high_resolution_clock::now(); 
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time); 
std::cout << functionName << " took " << duration.count() << " microseconds.n"; 
} while (false);
• Define macros
• START_TIMER: get current time
• STOP_TIMER: calculate elapsed time
• Insert macro at function entry and exit
23
Milestone 1 : Log execution time
void function1() {
START_TIMER;
for (int i = 0; i < 1000000; ++i) {}
STOP_TIMER("function1");
}
void function2() {
START_TIMER;
for (int i = 0; i < 500000; ++i) {}
STOP_TIMER("function2");
}
int main() {
function1();
function2();
return 0;
}
❯ ./a.out
function1 took 607 microseconds.
function2 took 291 microseconds.
24
Milestone 2: Insert less macros
class ExecutionTimer {
public:
ExecutionTimer(const char* functionName) : functionName(functionName) {
start = std::chrono::high_resolution_clock::now();
}
~ExecutionTimer() {
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - m_start);
std::cout << m_name << " took " << duration.count() << " microseconds.n";
}
private:
const char* m_name;
std::chrono::high_resolution_clock::time_point m_start;
};
• Make use of constructor and destructor
• Constructor: get current time
• Destructor: calculate duration
25
Milestone 2: Insert less macros
void function1() {
ExecutionTimer timer("function1");
for (int i = 0; i < 1000000; ++i) {}
}
void function2() {
ExecutionTimer timer("function2");
for (int i = 0; i < 500000; ++i) {}
}
int main() {
function1();
function2();
return 0;
}
❯ ./a.out
function1 took 607 microseconds.
function2 took 291 microseconds.
26
Milestone 3: hit count of each function
class TimedEntry
{
public:
size_t count() const { return m_count; }
double time() const { return m_time; }
TimedEntry & add_time(double time)
{
++m_count;
m_time += time;
return *this;
}
private:
size_t m_count = 0;
double m_time = 0.0;
};
Create another class to hold each function’s
• execution time
• hit count
27
Milestone 3: hit count of each function
class TimedEntry
{
public:
size_t count() const { return m_count; }
double time() const { return m_time; }
TimedEntry & add_time(double time)
{
++m_count;
m_time += time;
return *this;
}
private:
size_t m_count = 0;
double m_time = 0.0;
};
Create another class to hold each function’s
• execution time
• hit count
std::map<std::string, TimedEntry> m_map; Use a dictionary to hold the record
28
Milestone 3: hit count of each function
void function1() {
ExecutionTimer timer =
Profiler::getInstance().startTimer("function1");
for (int i = 0; i < 1000000; ++i) {}
}
void function2() {
ExecutionTimer timer =
Profiler::getInstance().startTimer("function2");
for (int i = 0; i < 500000; ++i) {}
}
int main() {
function1();
function2();
function2();
return 0;
}
❯ ./a.out
Profiler started.
Function1, hit = 1, time = 320 microseconds.
Function2, hit = 2, time = 314 microseconds.
29
Milestone 4: Call Path Profiling
• A function may have different caller
• Knowing which call path is frequently executed is important
• But how to maintain call tree during profiling?
a -> b -> c -> d -> e
a -> e
30
Milestone 4: Call Path Profiling – Radix Tree
Radix Tree
• Each node acts like a function
• The child node acts like a callee
• The profiling data could be stored within the node
https://static.lwn.net/images/ns/kernel/radix-tree-2.png
31
Milestone 4: Call Path Profiling - Radix Tree
Function calls
1 main
2 main -> a
3 main -> a -> b
4 main -> a -> b -> c
5 main -> a -> b
6 main -> a
7 main -> a -> c
main
a
b
c
c
• Dynamically grow the tree when profiling
32
Milestone 4: Call Path Profiling - RadixTreeNode
template <typename T>
class RadixTreeNode
{
public:
using child_list_type =
std::list<std::unique_ptr<RadixTreeNode<T>>>;
using key_type = int32_t;
RadixTreeNode(std::string const & name, key_type key)
: m_name(name)
, m_key(key)
, m_prev(nullptr)
{
}
private:
key_type m_key = -1;
std::string m_name;
T m_data;
child_list_type m_children;
RadixTreeNode<T> * m_prev = nullptr;
}
• A node has
• a function name
• Profiling data
• Execution time
• Hit count
• a list of children (callee)
• a pointer point back to parent (caller)
33
template <typename T>
class RadixTree
{
public:
using key_type = typename RadixTreeNode<T>::key_type;
RadixTree()
: m_root(std::make_unique<RadixTreeNode<T>>())
, m_current_node(m_root.get())
{
}
private:
key_type get_id(const std::string & name)
{
auto [it, inserted] = m_id_map.try_emplace(name,
m_unique_id++);
return it->second;
}
std::unique_ptr<RadixTreeNode<T>> m_root;
RadixTreeNode<T> * m_current_node;
std::unordered_map<std::string, key_type> m_id_map;
key_type m_unique_id = 0;
};
A tree has
• a root pointer
• a current pointer (on CPU function)
Milestone 4: Call Path Profiling - RadixTree
34
T & entry(const std::string & name)
{
key_type id = get_id(name);
RadixTreeNode<T> * child = m_current_node-
>get_child(id);
if (!child)
{
m_current_node = m_current_node->add_child(name,
id);
}
else
{
m_current_node = child;
}
return m_current_node->data();
}
Milestone 4: Call Path Profiling - RadixTree
When entering a function
• Map the function name to ID
• For faster int comparison
• Check if the current node has such child
• Create a child if not exists
• Increment the hit count
• Change the current pointer
35
void add_time(double time)
{
m_tree.get_current_node()->data().add_time(time);
m_tree.move_current_to_parent();
}
Milestone 4: Call Path Profiling - RadixTree
When leaving a function
• Update the execution time
• Change current pointer to caller
36
void add_time(double time)
{
m_tree.get_current_node()->data().add_time(time);
m_tree.move_current_to_parent();
}
Milestone 4: Call Path Profiling - RadixTree
Function calls
1 main
2 main -> a
3 main -> a -> b
4 main -> a -> b -> c
5 main -> a -> b
6 main -> a -> c
main()
a() : hit = 1, time = 680 microseconds
b() : hit = 1, time = 470 microseconds
c() : hit = 1, time = 120 microseconds
c() : hit = 1, time = 124 microseconds
When leaving a function
• Update the execution time
• Change current pointer to caller
SUMMARY
1. Sampling based profiler can quickly deliver performance metric
2. Intrusive based profiler can capture the program’s detailed behavior
3. Developing our own source code level profiler enables us to customize the
performance Metric in the future.
4. It’s more fun to craft the profiler rather than using the existing tool
37
THANK YOU
38

More Related Content

Similar to Building source code level profiler for C++.pdf (20)

PDF
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
Steve Caron
 
PDF
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
George Markomanolis
 
PDF
PyGotham 2014 Introduction to Profiling
Perrin Harkins
 
PDF
TIP1 - Overview of C/C++ Debugging/Tracing/Profiling Tools
Xiaozhe Wang
 
PDF
stackconf 2022: Optimize Performance with Continuous Production Profiling
NETWAYS
 
PDF
Containerizing HPC and AI applications using E4S and Performance Monitor tool
Ganesan Narayanasamy
 
PDF
Profiling in Python
Fabian Pedregosa
 
PDF
Where is the bottleneck
Manuel Miranda de Cid
 
PDF
TAU E4S ON OpenPOWER /POWER9 platform
Ganesan Narayanasamy
 
PDF
Peddle the Pedal to the Metal
C4Media
 
PDF
Reconsidering tracing in Ceph - Mohamad Gebai
Ceph Community
 
PDF
Callgraph analysis
Roberto Agostino Vitillo
 
PDF
Deep into your applications, performance & profiling
Fabien Arcellier
 
PPTX
Visual Studio 2013 Profiling
Denis Dudaev
 
PDF
Visual Studio Profiler
Betclic Everest Group Tech Team
 
PDF
What’s eating python performance
Piotr Przymus
 
PDF
Object Centric Reflection
ESUG
 
PDF
GOoDA tutorial
Roberto Agostino Vitillo
 
PDF
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)
Tim Bunce
 
PDF
Profiling PHP - AmsterdamPHP Meetup - 2014-11-20
Dennis de Greef
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
Steve Caron
 
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
George Markomanolis
 
PyGotham 2014 Introduction to Profiling
Perrin Harkins
 
TIP1 - Overview of C/C++ Debugging/Tracing/Profiling Tools
Xiaozhe Wang
 
stackconf 2022: Optimize Performance with Continuous Production Profiling
NETWAYS
 
Containerizing HPC and AI applications using E4S and Performance Monitor tool
Ganesan Narayanasamy
 
Profiling in Python
Fabian Pedregosa
 
Where is the bottleneck
Manuel Miranda de Cid
 
TAU E4S ON OpenPOWER /POWER9 platform
Ganesan Narayanasamy
 
Peddle the Pedal to the Metal
C4Media
 
Reconsidering tracing in Ceph - Mohamad Gebai
Ceph Community
 
Callgraph analysis
Roberto Agostino Vitillo
 
Deep into your applications, performance & profiling
Fabien Arcellier
 
Visual Studio 2013 Profiling
Denis Dudaev
 
Visual Studio Profiler
Betclic Everest Group Tech Team
 
What’s eating python performance
Piotr Przymus
 
Object Centric Reflection
ESUG
 
GOoDA tutorial
Roberto Agostino Vitillo
 
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)
Tim Bunce
 
Profiling PHP - AmsterdamPHP Meetup - 2014-11-20
Dennis de Greef
 

Recently uploaded (20)

PPTX
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
PPT
Carmon_Remote Sensing GIS by Mahesh kumar
DhananjayM6
 
PPTX
Arduino Based Gas Leakage Detector Project
CircuitDigest
 
PDF
Halide Perovskites’ Multifunctional Properties: Coordination Engineering, Coo...
TaameBerhe2
 
PPTX
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
DOCX
8th International Conference on Electrical Engineering (ELEN 2025)
elelijjournal653
 
PPT
Electrical Safety Presentation for Basics Learning
AliJaved79382
 
PPTX
Evaluation and thermal analysis of shell and tube heat exchanger as per requi...
shahveer210504
 
PPTX
Damage of stability of a ship and how its change .pptx
ehamadulhaque
 
PDF
Water Industry Process Automation & Control Monthly July 2025
Water Industry Process Automation & Control
 
PPTX
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
PPTX
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
PDF
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
DOCX
CS-802 (A) BDH Lab manual IPS Academy Indore
thegodhimself05
 
PDF
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
PPTX
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
PDF
Design Thinking basics for Engineers.pdf
CMR University
 
PDF
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
PPTX
DATA BASE MANAGEMENT AND RELATIONAL DATA
gomathisankariv2
 
PDF
Electrical Engineer operation Supervisor
ssaruntatapower143
 
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
Carmon_Remote Sensing GIS by Mahesh kumar
DhananjayM6
 
Arduino Based Gas Leakage Detector Project
CircuitDigest
 
Halide Perovskites’ Multifunctional Properties: Coordination Engineering, Coo...
TaameBerhe2
 
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
8th International Conference on Electrical Engineering (ELEN 2025)
elelijjournal653
 
Electrical Safety Presentation for Basics Learning
AliJaved79382
 
Evaluation and thermal analysis of shell and tube heat exchanger as per requi...
shahveer210504
 
Damage of stability of a ship and how its change .pptx
ehamadulhaque
 
Water Industry Process Automation & Control Monthly July 2025
Water Industry Process Automation & Control
 
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
CS-802 (A) BDH Lab manual IPS Academy Indore
thegodhimself05
 
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
Design Thinking basics for Engineers.pdf
CMR University
 
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
DATA BASE MANAGEMENT AND RELATIONAL DATA
gomathisankariv2
 
Electrical Engineer operation Supervisor
ssaruntatapower143
 
Ad

Building source code level profiler for C++.pdf

  • 1. BUILDING SOURCE CODE LEVEL PROFILER FOR C++ APPLICATION Quentin Tsai Sciwork Conference 2023
  • 2. Hello! • Graduate from NYCU • Software QA automation Engineer @ Nvidia (RDSS) • Software Automation Testing, Performance Testing 2 I amQuentin Tsai [email protected]
  • 3. When my code is running slowly Check Resource usage • I/O • Memory • CPU usage 3
  • 4. When my code is running slowly Check Resource usage • I/O • Memory • CPU usage 4
  • 5. When my code is running slowly Check Resource usage • I/O • Memory • CPU usage Identify the bottleneck 5 • Nested loops • Excessive function calls • Inefficient algorithm • Improper data structure
  • 6. When my code is running slowly Check Resource usage • I/O • Memory • CPU usage Identify the bottleneck 6 Optimize the code • Parallelization • Memory Optimization • Algorithm time complexity • Nested loops • Excessive function calls • Inefficient algorithm • Improper data structure
  • 7. When my code is running slowly Check Resource usage • I/O • Memory • CPU usage Identify the bottleneck 7 Optimize the code • Parallelization • Memory Optimization • Algorithm time complexity • Nested loops • Excessive function calls • Inefficient algorithm • Improper data structure But how to find the bottleneck?
  • 8. Which part of my code runs slowly? 8 #include <iostream> #include <ctime> int main() { // Record the start time clock_t start = clock(); do_something(); // Record the stop time clock_t stop = clock(); // Calculate the elapsed time double elapsed_time = static_cast<double>(stop - start) / CLOCKS_PER_SEC; // Output the time taken std::cout << "Time taken by do_something: " << elapsed_time << " seconds" << std::endl; return 0; } Measure each function respectively?
  • 9. Profilers Tools to help programmers measure and reason about performance 9
  • 10. What is profiler? 10 a tool used to analyze the program runtime behavior and performance characteristics.
  • 11. Sampling profiling • Attach to program, periodically interrupt and record the on-CPU function 11
  • 12. Sampling profiling • Attach to program, periodically interrupt and record the on-CPU function 12 Time Function c Function d
  • 13. Sampling profiling • Attach to program, periodically interrupt and record the on-CPU function 13 Time Function c x6 Function d
  • 14. Sampling profiling • Attach to program, periodically interrupt and record the on-CPU function 14 Time Function c x6 Function d x3
  • 15. Sampling profiling • Attach to program, periodically interrupt and record the on-CPU function 15 Time Function c x6 Function d x3 Focus on optimizing function c?
  • 16. Sampling profiling • Attach to program, periodically interrupt and record the on-CPU function 16 • For each sample, record stack trace Time Function c Function d
  • 17. Sampling profiling • Attach to program, periodically interrupt and record the on-CPU function 17 • For each sample, record stack trace Time Function c Function d main a b c main a b c d
  • 18. Instrumentation profiling • Insert code to the program to record performance metric • Manually inserted by programmers • Automatically inserted via some tools 18
  • 19. Sampling VS Instrumentation Sampling • Non-Intrusive • Low Overhead Instrumentation • Inline functions are invisible • only approximations and not accurate​​​ 19 Pros Cons • Inline function visible • More accurate • More customizable • Significant overhead • Require source code / binary rewriting
  • 20. # Overhead Samples Command Shared Object Symbol # ........ ............ ....... ................. ................................... # 20.42% 605 bash [kernel.kallsyms] [k] xen_hypercall_xen_version | --- xen_hypercall_xen_version check_events | |--44.13%-- syscall_trace_enter | tracesys | | | |--35.58%-- __GI___libc_fcntl | | | | | |--65.26%-- do_redirection_internal | | | do_redirections | | | execute_builtin_or_function | | | execute_simple_command | | | execute_command_internal | | | execute_command | | | execute_while_or_until | | | execute_while_command | | | execute_command_internal | | | execute_command | | | reader_loop | | | main | | | __libc_start_main | | | | | --34.74%-- do_redirections | | | | | |--54.55%-- execute_builtin_or_function | | | execute_simple_command | | | execute_command_internal | | | execute_command | | | execute_while_or_until | | | execute_while_command | | | execute_command_internal | | | execute_command | | | reader_loop | | | main | | | __libc_start_main | | | Linux Perf Linux built in sampling-based profiler 20
  • 21. Build a simple source code level profiler 21
  • 22. 22 Milestone 1: Log execution time #include <iostream> #include <chrono> #define START_TIMER auto start_time = std::chrono::high_resolution_clock::now(); #define STOP_TIMER(functionName) do { auto end_time = std::chrono::high_resolution_clock::now(); auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time); std::cout << functionName << " took " << duration.count() << " microseconds.n"; } while (false); • Define macros • START_TIMER: get current time • STOP_TIMER: calculate elapsed time • Insert macro at function entry and exit
  • 23. 23 Milestone 1 : Log execution time void function1() { START_TIMER; for (int i = 0; i < 1000000; ++i) {} STOP_TIMER("function1"); } void function2() { START_TIMER; for (int i = 0; i < 500000; ++i) {} STOP_TIMER("function2"); } int main() { function1(); function2(); return 0; } ❯ ./a.out function1 took 607 microseconds. function2 took 291 microseconds.
  • 24. 24 Milestone 2: Insert less macros class ExecutionTimer { public: ExecutionTimer(const char* functionName) : functionName(functionName) { start = std::chrono::high_resolution_clock::now(); } ~ExecutionTimer() { auto end = std::chrono::high_resolution_clock::now(); auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - m_start); std::cout << m_name << " took " << duration.count() << " microseconds.n"; } private: const char* m_name; std::chrono::high_resolution_clock::time_point m_start; }; • Make use of constructor and destructor • Constructor: get current time • Destructor: calculate duration
  • 25. 25 Milestone 2: Insert less macros void function1() { ExecutionTimer timer("function1"); for (int i = 0; i < 1000000; ++i) {} } void function2() { ExecutionTimer timer("function2"); for (int i = 0; i < 500000; ++i) {} } int main() { function1(); function2(); return 0; } ❯ ./a.out function1 took 607 microseconds. function2 took 291 microseconds.
  • 26. 26 Milestone 3: hit count of each function class TimedEntry { public: size_t count() const { return m_count; } double time() const { return m_time; } TimedEntry & add_time(double time) { ++m_count; m_time += time; return *this; } private: size_t m_count = 0; double m_time = 0.0; }; Create another class to hold each function’s • execution time • hit count
  • 27. 27 Milestone 3: hit count of each function class TimedEntry { public: size_t count() const { return m_count; } double time() const { return m_time; } TimedEntry & add_time(double time) { ++m_count; m_time += time; return *this; } private: size_t m_count = 0; double m_time = 0.0; }; Create another class to hold each function’s • execution time • hit count std::map<std::string, TimedEntry> m_map; Use a dictionary to hold the record
  • 28. 28 Milestone 3: hit count of each function void function1() { ExecutionTimer timer = Profiler::getInstance().startTimer("function1"); for (int i = 0; i < 1000000; ++i) {} } void function2() { ExecutionTimer timer = Profiler::getInstance().startTimer("function2"); for (int i = 0; i < 500000; ++i) {} } int main() { function1(); function2(); function2(); return 0; } ❯ ./a.out Profiler started. Function1, hit = 1, time = 320 microseconds. Function2, hit = 2, time = 314 microseconds.
  • 29. 29 Milestone 4: Call Path Profiling • A function may have different caller • Knowing which call path is frequently executed is important • But how to maintain call tree during profiling? a -> b -> c -> d -> e a -> e
  • 30. 30 Milestone 4: Call Path Profiling – Radix Tree Radix Tree • Each node acts like a function • The child node acts like a callee • The profiling data could be stored within the node https://static.lwn.net/images/ns/kernel/radix-tree-2.png
  • 31. 31 Milestone 4: Call Path Profiling - Radix Tree Function calls 1 main 2 main -> a 3 main -> a -> b 4 main -> a -> b -> c 5 main -> a -> b 6 main -> a 7 main -> a -> c main a b c c • Dynamically grow the tree when profiling
  • 32. 32 Milestone 4: Call Path Profiling - RadixTreeNode template <typename T> class RadixTreeNode { public: using child_list_type = std::list<std::unique_ptr<RadixTreeNode<T>>>; using key_type = int32_t; RadixTreeNode(std::string const & name, key_type key) : m_name(name) , m_key(key) , m_prev(nullptr) { } private: key_type m_key = -1; std::string m_name; T m_data; child_list_type m_children; RadixTreeNode<T> * m_prev = nullptr; } • A node has • a function name • Profiling data • Execution time • Hit count • a list of children (callee) • a pointer point back to parent (caller)
  • 33. 33 template <typename T> class RadixTree { public: using key_type = typename RadixTreeNode<T>::key_type; RadixTree() : m_root(std::make_unique<RadixTreeNode<T>>()) , m_current_node(m_root.get()) { } private: key_type get_id(const std::string & name) { auto [it, inserted] = m_id_map.try_emplace(name, m_unique_id++); return it->second; } std::unique_ptr<RadixTreeNode<T>> m_root; RadixTreeNode<T> * m_current_node; std::unordered_map<std::string, key_type> m_id_map; key_type m_unique_id = 0; }; A tree has • a root pointer • a current pointer (on CPU function) Milestone 4: Call Path Profiling - RadixTree
  • 34. 34 T & entry(const std::string & name) { key_type id = get_id(name); RadixTreeNode<T> * child = m_current_node- >get_child(id); if (!child) { m_current_node = m_current_node->add_child(name, id); } else { m_current_node = child; } return m_current_node->data(); } Milestone 4: Call Path Profiling - RadixTree When entering a function • Map the function name to ID • For faster int comparison • Check if the current node has such child • Create a child if not exists • Increment the hit count • Change the current pointer
  • 35. 35 void add_time(double time) { m_tree.get_current_node()->data().add_time(time); m_tree.move_current_to_parent(); } Milestone 4: Call Path Profiling - RadixTree When leaving a function • Update the execution time • Change current pointer to caller
  • 36. 36 void add_time(double time) { m_tree.get_current_node()->data().add_time(time); m_tree.move_current_to_parent(); } Milestone 4: Call Path Profiling - RadixTree Function calls 1 main 2 main -> a 3 main -> a -> b 4 main -> a -> b -> c 5 main -> a -> b 6 main -> a -> c main() a() : hit = 1, time = 680 microseconds b() : hit = 1, time = 470 microseconds c() : hit = 1, time = 120 microseconds c() : hit = 1, time = 124 microseconds When leaving a function • Update the execution time • Change current pointer to caller
  • 37. SUMMARY 1. Sampling based profiler can quickly deliver performance metric 2. Intrusive based profiler can capture the program’s detailed behavior 3. Developing our own source code level profiler enables us to customize the performance Metric in the future. 4. It’s more fun to craft the profiler rather than using the existing tool 37