SlideShare a Scribd company logo
20 April 2019, Tbilisi, Georgia
Ivan Shamrai, Senior NFT Analyst, Exactpro
Failover and Recovery Test Automation
20 April 2019, Tbilisi, Georgia 2
20 April 2019, Tbilisi, Georgia
Financial infrastructures
● Exchanges
● Broker systems
● Clearing agencies
● Ticker plants
● Surveillance systems
Risks associated with financial infrastructure outage:
● Lost profit
● Data loss
● Damaged reputation
3
20 April 2019, Tbilisi, Georgia
Distributed high-performance computing
● Bare-metal servers (no virtualization, no Docker and other
handy tools)
● Horizontal scalability
● Redundancy (absence of single point of failure)
4
20 April 2019, Tbilisi, Georgia
Resilience tests
● Hardware outages
○ Network equipment failovers (Switches, Ports, Network adapters)
○ Server isolations
● Software outages
○ Simulation of various outage types (SIGKILL, SIGSTOP)
○ Failovers during different system state (at startup / trading day / during auction)
5
20 April 2019, Tbilisi, Georgia
What cases to test?
● Failover – failure of active primary instance
(standby becomes active)
● Failback – failure of active standby
instance
● Standby failure – failure of passive
standby instance
● Double failure – simultaneous failure of
both instances
6
20 April 2019, Tbilisi, Georgia
What kinds of data do we need and when?
● Pre-SOD: system snapshots and backups
● Real-time:
○ System metrics of all servers and all components (processes)
○ Captured traffic of injected load and system responses
○ Log files of the system
● Post-EOD: log data for passive testing and results analysis
7
20 April 2019, Tbilisi, Georgia 8
20 April 2019, Tbilisi, Georgia
Defects mining in collected data
● Log entries per second
● Warnings per second
● Errors per second
● Transaction statistics
● Response time (latency)
● Throughput
● Disk usage
● RAM usage
● CPU usage
● Network stats
System statistics Captured traffic Log files
9
20 April 2019, Tbilisi, Georgia
Rules and thresholds
ALERT:
METRIC : RSS
GROWTH : 1GB
TIME : 10 MIN
ALERT:
METRIC : DISK
GROWTH : 10%
TIME : 1 HOUR
Server: MP101
Process: MatchingEngine Primary
Metric: RSS (resident set size)
10
20 April 2019, Tbilisi, Georgia
Spikes and stairs detection
Server: OE102
Process: FixGateway Standby
Metric: RSS (resident set size)
11
20 April 2019, Tbilisi, Georgia
Spikes and stairs detection
Example:
● CPU usage spike happened on
TransactionRouter component at ~11:49
● Most likely last scenario step done prior to
11:49 caused that spike
● Information about this abnormality and steps
that produced it will be populated in final report
Server: CA104
Process: TransactionRouter Primary
Metric: CPU usage
12
20 April 2019, Tbilisi, Georgia
Data reconciliation checks
● Consistency across different data streams
○ Client’s messages
○ Public market data
○ Aggregated market data
● Consistency between data streams and system’s database
13
20 April 2019, Tbilisi, Georgia
How to collect data in real-time?
● Use of available system tools
● Use of monitoring provided by a proprietary software vendor
● Use of third party monitoring tools
14
20 April 2019, Tbilisi, Georgia
How about reinventing the wheel?
● Independent
● Incorporate all the features we need in one tool
● Remote controlled
● Support of different output formats: protobuf, json, raw binary data
● Support of multiple data consumers with different visibility
● Deliver data on need to know basis only
● Uniform data format across all environments
● Low footprint
15
20 April 2019, Tbilisi, Georgia
Downsides of the brand new bicycle
● Green code: not well tested in the field
● Requires additional resources for support
● Solves only a particular problem
16
20 April 2019, Tbilisi, Georgia
Who should receive real-time data?
● Different tests require dozens of different metrics
● A tester is not able to track all the changes
● All the data should be analyzed on the fly
● Test behaviour should be changed depending on the received data
17
20 April 2019, Tbilisi, Georgia
High level view on real-time monitoring
Daemon_S
Collecting system info, logs
parsing, commands execution
Collecting system info, logs
parsing, commands execution
Load control and test scripts
execution
Communication between daemons
and controllers
TestManager: Automated execution of
test scenarios, collecting and
processing test information
Daemon_M
Daemon_I
Router
TM
Data
Processor
Transform, collect and store data for
future use
Data visualisation and reporting
Data storage
...
Management
Server
QA Server
Server 1
Server 2
Server N
Router
Daemon_M
Daemon_S1
Daemon_S2
Daemon_SN
Daemon_I
TM
Data
Processor
18
20 April 2019, Tbilisi, Georgia
Passive monitoring
Management Server
TestManager
Data
Processor
Router
Matching Server
Daemon MEP
MatchingEnginePrimary
Matching log
Monitoring Server
Daemon MON
System events log
System metrics log
MatchingEnginePrimary {PID: 1234, RSS: 500MB, CP Usage: 15%}
System
MatchingEnginePrimary {STATE: READY}
MatchingEnginePrimary {INTERNAL LATENCY: 10}
System {CPU Usage: 15%, Free Mem: 50%, Free Disk Space:
80%}
The MON Daemon collects system
metrics and messages
The MEP Daemon parses matching log
and provides router with actual system info
19
20 April 2019, Tbilisi, Georgia
Active monitoring
Management Server
TestManager
Data
Processor
Router
Matching Server
Daemon MEP
MatchingEnginePrimary
Matching log
Monitoring Server
Daemon MON
System events log
System metrics log
System
System {CPU Usage: 1%, Free Mem: 75%, Free Disk Space: 83%}
Stop matching log monitor When realtime data is not required user
or an automated scenario can stop or
update a task for an active monitor to
reduce system load.
20
20 April 2019, Tbilisi, Georgia
Post-EOD data
● Checkpoints from the TestManager tool
● System and hardware usage stats
● Essential internal metrics from the system under test
21
20 April 2019, Tbilisi, Georgia
...
~|=============================================================================
~|Disk I/O statistics
~|=============================================================================
~|Device Reads/sec Writes/sec AvgQSize AvgWait AvgSrvceTime
~|sda 0.0 ( 0.0kB) 4.1 ( 22.4kB) 0.0 0.0ms 0.0ms
~|sdb 0.0 ( 0.0kB) 0.0 ( 0.0kB) 0.0 0.0ms 0.0ms
~|sdc 0.0 ( 0.0kB) 10.7 ( 70.5kB) 0.0 0.0ms 0.0ms
20181030074410.191|504|TEXT |System Memory Information (from /proc/meminfo)
~|=============================================================================
~|MemTotal: 263868528 kB
~|MemFree: 252390192 kB
...
What’s wrong with system logs?
Bias: logs should be human friendly
22
20 April 2019, Tbilisi, Georgia
Not standardized
Release 1:
Release 2:
Oct 30 2017 13:30:28 | SystemComponent:1 | Transfer Queue| Rate=0.00
(W=0.00,L=0.00, Q=0.00, T=0.00), Max Queues=(Pub=0, Pvt=0),
Dec 12 2017 08:10:13 | SystemComponent:1 | Transfer Queue from Rcv Thread to Main Thread | Rate=0.00
| W=0.00 | L=0.00 | Q=0.00 | T=0.00
Dec 12 2017 08:10:13 | SystemComponent:1 | Max Queues from Rcv Thread to Main Thread | Pub=0, Pvt=0
What’s wrong with system logs?
23
20 April 2019, Tbilisi, Georgia
UNKNOWN METRIC DETECTED:
[SystemComponent:1]: A To B | Rate=0.00 (W=0.00,L=0.00, Q=0.00, T=0.00), Mode=LOW_LATENCY, Max
Queues=(Pub=0, Pvt=0)
KNOWN METRICS:
[SystemComponent:1]: AToB | Rate=0.00 [W=0.00,L=0.00, Q=0.00, T=0.00], Mode=LOW_LATENCY, Max
Queues=[Pub=0, Pvt=0]
[SystemComponent:1]: ABToWorker | Rate=0.00 (W=0.00,L=0.00, Q=0.00, T=0.00), Mode=LOW_LATENCY, Max
Queues=(Pub=0, Pvt=0)
How to deal with creative loggers?
● Accept the reality
● No one will change log format just for you
● No one will ask you prior to log format change
● Regexpish patterns are our “best friends”
● Automatic log formats analysis
24
20 April 2019, Tbilisi, Georgia
Where to store and how long?
● Data is sensitive and should be stored on the client’s side
● Data volume is huge for limited hardware resources in the test environment
● Data retention
● HW stats
● System merics
● System configs
● Traffic
● Anonymous production data
● System configs
● Aggregated test reports
Current data Historical data
2 weeks
25
20 April 2019, Tbilisi, Georgia
How to use?
● Reporting
● Analysis
● Tests improvement
26
20 April 2019, Tbilisi, Georgia
Reporting
27
20 April 2019, Tbilisi, Georgia
Software Testing is Relentless Learning
28

More Related Content

PDF
Parallel machines flinkforward2017
Nisha Talagala
 
PDF
1_WTTx LTM Main Slide_Huawei_Presentation.pdf
rodrigopbarreto
 
PPTX
Lisbon Mulesoft Meetup - Logging Aggregation & Visualization
Steve Michael Fernandes
 
PDF
Modern Monitoring - SysAdminDay 2017
Opsta
 
PDF
IRJET- Trafficop Android Application for Management of Traffic Violations
IRJET Journal
 
PPTX
RTO Management System
SangramMatkar
 
PDF
Internet of Things Microservices
Capgemini
 
PPT
Chep04 Talk
FNian
 
Parallel machines flinkforward2017
Nisha Talagala
 
1_WTTx LTM Main Slide_Huawei_Presentation.pdf
rodrigopbarreto
 
Lisbon Mulesoft Meetup - Logging Aggregation & Visualization
Steve Michael Fernandes
 
Modern Monitoring - SysAdminDay 2017
Opsta
 
IRJET- Trafficop Android Application for Management of Traffic Violations
IRJET Journal
 
RTO Management System
SangramMatkar
 
Internet of Things Microservices
Capgemini
 
Chep04 Talk
FNian
 

Similar to EXTENT Talks 2019 Tbilisi: Failover and Recovery Test Automation - Ivan Shamrai (20)

PPT
itSMF Presentation March 2009
jdmoore
 
PDF
Performance Aware SDN, LSPE talk
netvis
 
PDF
Construction of Supervisory Control and Data Acquisition in Shop-level based ...
IJRES Journal
 
PPTX
Training Webinar: Detect Performance Bottlenecks of Applications
OutSystems
 
PDF
IoT Dynatrace
Malik BC
 
PDF
Doug Sillars on App Optimization
wipjam
 
PDF
ML in Production at FunTech Meetup (Feb 2019)
Mark Andreev
 
PDF
An emulation framework for IoT, Fog, and Edge Applications
MoysisSymeonides
 
PDF
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Guglielmo Iozzia
 
PDF
Agata overview
Udi Levin
 
PPTX
Automated Process Improvement: Status, Challenges, and Perspectives
Marlon Dumas
 
PDF
Advanced metering infrastructure (AMI)
Arul Kumar
 
PDF
8 Ways Utility Networks Can Meet Data Demands
Safe Software
 
PDF
Streaming Analytics Unit 1 notes for engineers
ManjuAppukuttan2
 
PDF
Stream Processing Overview
Maycon Viana Bordin
 
PDF
Ginsbourg.com - Performance and load test report template ltr 2.0
Shay Ginsbourg
 
PPTX
Performance Testing from Scratch + JMeter intro
Mykola Kovsh
 
PDF
Leveraging Data Integration for Strategic GIS Governance
Safe Software
 
PDF
13 information system audit of banks
spandane
 
PPTX
Business Process Analytics: From Insights to Predictions
Marlon Dumas
 
itSMF Presentation March 2009
jdmoore
 
Performance Aware SDN, LSPE talk
netvis
 
Construction of Supervisory Control and Data Acquisition in Shop-level based ...
IJRES Journal
 
Training Webinar: Detect Performance Bottlenecks of Applications
OutSystems
 
IoT Dynatrace
Malik BC
 
Doug Sillars on App Optimization
wipjam
 
ML in Production at FunTech Meetup (Feb 2019)
Mark Andreev
 
An emulation framework for IoT, Fog, and Edge Applications
MoysisSymeonides
 
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Guglielmo Iozzia
 
Agata overview
Udi Levin
 
Automated Process Improvement: Status, Challenges, and Perspectives
Marlon Dumas
 
Advanced metering infrastructure (AMI)
Arul Kumar
 
8 Ways Utility Networks Can Meet Data Demands
Safe Software
 
Streaming Analytics Unit 1 notes for engineers
ManjuAppukuttan2
 
Stream Processing Overview
Maycon Viana Bordin
 
Ginsbourg.com - Performance and load test report template ltr 2.0
Shay Ginsbourg
 
Performance Testing from Scratch + JMeter intro
Mykola Kovsh
 
Leveraging Data Integration for Strategic GIS Governance
Safe Software
 
13 information system audit of banks
spandane
 
Business Process Analytics: From Insights to Predictions
Marlon Dumas
 
Ad

More from Iosif Itkin (20)

PDF
Foundations of Software Testing Lecture 4
Iosif Itkin
 
PPTX
QA Financial Forum London 2021 - Automation in Software Testing. Humans and C...
Iosif Itkin
 
PDF
Exactpro FinTech Webinar - Global Exchanges Test Oracles
Iosif Itkin
 
PDF
Exactpro FinTech Webinar - Global Exchanges FIX Protocol
Iosif Itkin
 
PDF
Operational Resilience in Financial Market Infrastructures
Iosif Itkin
 
PDF
20 Simple Questions from Exactpro for Your Enjoyment This Holiday Season
Iosif Itkin
 
PDF
Testing the Intelligence of your AI
Iosif Itkin
 
PDF
EXTENT 2019: Exactpro Quality Assurance for Financial Market Infrastructures
Iosif Itkin
 
PDF
ClearTH Test Automation Framework: Case Study in IRS & CDS Swaps Lifecycle Mo...
Iosif Itkin
 
PDF
EXTENT Talks QA Community Tbilisi 20 April 2019 - Conference Open
Iosif Itkin
 
PDF
User-Assisted Log Analysis for Quality Control of Distributed Fintech Applica...
Iosif Itkin
 
PPTX
QAFF Chicago 2019 - Complex Post-Trade Systems, Requirements Traceability and...
Iosif Itkin
 
PDF
QA Community Saratov: Past, Present, Future (2019-02-08)
Iosif Itkin
 
PDF
Machine Learning and RoboCop Testing
Iosif Itkin
 
PDF
Behaviour Driven Development: Oltre i limiti del possibile
Iosif Itkin
 
PDF
2018 - Exactpro Year in Review
Iosif Itkin
 
PPTX
Exactpro Discussion about Joy and Strategy
Iosif Itkin
 
PPTX
FIX EMEA Conference 2018 - Post Trade Software Testing Challenges
Iosif Itkin
 
PDF
BDD. The Outer Limits. Iosif Itkin at Youcon (in Russian)
Iosif Itkin
 
PPTX
Sibos 2017: Disruptive functional testing - the next frontier in post-trade s...
Iosif Itkin
 
Foundations of Software Testing Lecture 4
Iosif Itkin
 
QA Financial Forum London 2021 - Automation in Software Testing. Humans and C...
Iosif Itkin
 
Exactpro FinTech Webinar - Global Exchanges Test Oracles
Iosif Itkin
 
Exactpro FinTech Webinar - Global Exchanges FIX Protocol
Iosif Itkin
 
Operational Resilience in Financial Market Infrastructures
Iosif Itkin
 
20 Simple Questions from Exactpro for Your Enjoyment This Holiday Season
Iosif Itkin
 
Testing the Intelligence of your AI
Iosif Itkin
 
EXTENT 2019: Exactpro Quality Assurance for Financial Market Infrastructures
Iosif Itkin
 
ClearTH Test Automation Framework: Case Study in IRS & CDS Swaps Lifecycle Mo...
Iosif Itkin
 
EXTENT Talks QA Community Tbilisi 20 April 2019 - Conference Open
Iosif Itkin
 
User-Assisted Log Analysis for Quality Control of Distributed Fintech Applica...
Iosif Itkin
 
QAFF Chicago 2019 - Complex Post-Trade Systems, Requirements Traceability and...
Iosif Itkin
 
QA Community Saratov: Past, Present, Future (2019-02-08)
Iosif Itkin
 
Machine Learning and RoboCop Testing
Iosif Itkin
 
Behaviour Driven Development: Oltre i limiti del possibile
Iosif Itkin
 
2018 - Exactpro Year in Review
Iosif Itkin
 
Exactpro Discussion about Joy and Strategy
Iosif Itkin
 
FIX EMEA Conference 2018 - Post Trade Software Testing Challenges
Iosif Itkin
 
BDD. The Outer Limits. Iosif Itkin at Youcon (in Russian)
Iosif Itkin
 
Sibos 2017: Disruptive functional testing - the next frontier in post-trade s...
Iosif Itkin
 
Ad

Recently uploaded (20)

PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
The Future of Artificial Intelligence (AI)
Mukul
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 

EXTENT Talks 2019 Tbilisi: Failover and Recovery Test Automation - Ivan Shamrai

  • 1. 20 April 2019, Tbilisi, Georgia Ivan Shamrai, Senior NFT Analyst, Exactpro Failover and Recovery Test Automation
  • 2. 20 April 2019, Tbilisi, Georgia 2
  • 3. 20 April 2019, Tbilisi, Georgia Financial infrastructures ● Exchanges ● Broker systems ● Clearing agencies ● Ticker plants ● Surveillance systems Risks associated with financial infrastructure outage: ● Lost profit ● Data loss ● Damaged reputation 3
  • 4. 20 April 2019, Tbilisi, Georgia Distributed high-performance computing ● Bare-metal servers (no virtualization, no Docker and other handy tools) ● Horizontal scalability ● Redundancy (absence of single point of failure) 4
  • 5. 20 April 2019, Tbilisi, Georgia Resilience tests ● Hardware outages ○ Network equipment failovers (Switches, Ports, Network adapters) ○ Server isolations ● Software outages ○ Simulation of various outage types (SIGKILL, SIGSTOP) ○ Failovers during different system state (at startup / trading day / during auction) 5
  • 6. 20 April 2019, Tbilisi, Georgia What cases to test? ● Failover – failure of active primary instance (standby becomes active) ● Failback – failure of active standby instance ● Standby failure – failure of passive standby instance ● Double failure – simultaneous failure of both instances 6
  • 7. 20 April 2019, Tbilisi, Georgia What kinds of data do we need and when? ● Pre-SOD: system snapshots and backups ● Real-time: ○ System metrics of all servers and all components (processes) ○ Captured traffic of injected load and system responses ○ Log files of the system ● Post-EOD: log data for passive testing and results analysis 7
  • 8. 20 April 2019, Tbilisi, Georgia 8
  • 9. 20 April 2019, Tbilisi, Georgia Defects mining in collected data ● Log entries per second ● Warnings per second ● Errors per second ● Transaction statistics ● Response time (latency) ● Throughput ● Disk usage ● RAM usage ● CPU usage ● Network stats System statistics Captured traffic Log files 9
  • 10. 20 April 2019, Tbilisi, Georgia Rules and thresholds ALERT: METRIC : RSS GROWTH : 1GB TIME : 10 MIN ALERT: METRIC : DISK GROWTH : 10% TIME : 1 HOUR Server: MP101 Process: MatchingEngine Primary Metric: RSS (resident set size) 10
  • 11. 20 April 2019, Tbilisi, Georgia Spikes and stairs detection Server: OE102 Process: FixGateway Standby Metric: RSS (resident set size) 11
  • 12. 20 April 2019, Tbilisi, Georgia Spikes and stairs detection Example: ● CPU usage spike happened on TransactionRouter component at ~11:49 ● Most likely last scenario step done prior to 11:49 caused that spike ● Information about this abnormality and steps that produced it will be populated in final report Server: CA104 Process: TransactionRouter Primary Metric: CPU usage 12
  • 13. 20 April 2019, Tbilisi, Georgia Data reconciliation checks ● Consistency across different data streams ○ Client’s messages ○ Public market data ○ Aggregated market data ● Consistency between data streams and system’s database 13
  • 14. 20 April 2019, Tbilisi, Georgia How to collect data in real-time? ● Use of available system tools ● Use of monitoring provided by a proprietary software vendor ● Use of third party monitoring tools 14
  • 15. 20 April 2019, Tbilisi, Georgia How about reinventing the wheel? ● Independent ● Incorporate all the features we need in one tool ● Remote controlled ● Support of different output formats: protobuf, json, raw binary data ● Support of multiple data consumers with different visibility ● Deliver data on need to know basis only ● Uniform data format across all environments ● Low footprint 15
  • 16. 20 April 2019, Tbilisi, Georgia Downsides of the brand new bicycle ● Green code: not well tested in the field ● Requires additional resources for support ● Solves only a particular problem 16
  • 17. 20 April 2019, Tbilisi, Georgia Who should receive real-time data? ● Different tests require dozens of different metrics ● A tester is not able to track all the changes ● All the data should be analyzed on the fly ● Test behaviour should be changed depending on the received data 17
  • 18. 20 April 2019, Tbilisi, Georgia High level view on real-time monitoring Daemon_S Collecting system info, logs parsing, commands execution Collecting system info, logs parsing, commands execution Load control and test scripts execution Communication between daemons and controllers TestManager: Automated execution of test scenarios, collecting and processing test information Daemon_M Daemon_I Router TM Data Processor Transform, collect and store data for future use Data visualisation and reporting Data storage ... Management Server QA Server Server 1 Server 2 Server N Router Daemon_M Daemon_S1 Daemon_S2 Daemon_SN Daemon_I TM Data Processor 18
  • 19. 20 April 2019, Tbilisi, Georgia Passive monitoring Management Server TestManager Data Processor Router Matching Server Daemon MEP MatchingEnginePrimary Matching log Monitoring Server Daemon MON System events log System metrics log MatchingEnginePrimary {PID: 1234, RSS: 500MB, CP Usage: 15%} System MatchingEnginePrimary {STATE: READY} MatchingEnginePrimary {INTERNAL LATENCY: 10} System {CPU Usage: 15%, Free Mem: 50%, Free Disk Space: 80%} The MON Daemon collects system metrics and messages The MEP Daemon parses matching log and provides router with actual system info 19
  • 20. 20 April 2019, Tbilisi, Georgia Active monitoring Management Server TestManager Data Processor Router Matching Server Daemon MEP MatchingEnginePrimary Matching log Monitoring Server Daemon MON System events log System metrics log System System {CPU Usage: 1%, Free Mem: 75%, Free Disk Space: 83%} Stop matching log monitor When realtime data is not required user or an automated scenario can stop or update a task for an active monitor to reduce system load. 20
  • 21. 20 April 2019, Tbilisi, Georgia Post-EOD data ● Checkpoints from the TestManager tool ● System and hardware usage stats ● Essential internal metrics from the system under test 21
  • 22. 20 April 2019, Tbilisi, Georgia ... ~|============================================================================= ~|Disk I/O statistics ~|============================================================================= ~|Device Reads/sec Writes/sec AvgQSize AvgWait AvgSrvceTime ~|sda 0.0 ( 0.0kB) 4.1 ( 22.4kB) 0.0 0.0ms 0.0ms ~|sdb 0.0 ( 0.0kB) 0.0 ( 0.0kB) 0.0 0.0ms 0.0ms ~|sdc 0.0 ( 0.0kB) 10.7 ( 70.5kB) 0.0 0.0ms 0.0ms 20181030074410.191|504|TEXT |System Memory Information (from /proc/meminfo) ~|============================================================================= ~|MemTotal: 263868528 kB ~|MemFree: 252390192 kB ... What’s wrong with system logs? Bias: logs should be human friendly 22
  • 23. 20 April 2019, Tbilisi, Georgia Not standardized Release 1: Release 2: Oct 30 2017 13:30:28 | SystemComponent:1 | Transfer Queue| Rate=0.00 (W=0.00,L=0.00, Q=0.00, T=0.00), Max Queues=(Pub=0, Pvt=0), Dec 12 2017 08:10:13 | SystemComponent:1 | Transfer Queue from Rcv Thread to Main Thread | Rate=0.00 | W=0.00 | L=0.00 | Q=0.00 | T=0.00 Dec 12 2017 08:10:13 | SystemComponent:1 | Max Queues from Rcv Thread to Main Thread | Pub=0, Pvt=0 What’s wrong with system logs? 23
  • 24. 20 April 2019, Tbilisi, Georgia UNKNOWN METRIC DETECTED: [SystemComponent:1]: A To B | Rate=0.00 (W=0.00,L=0.00, Q=0.00, T=0.00), Mode=LOW_LATENCY, Max Queues=(Pub=0, Pvt=0) KNOWN METRICS: [SystemComponent:1]: AToB | Rate=0.00 [W=0.00,L=0.00, Q=0.00, T=0.00], Mode=LOW_LATENCY, Max Queues=[Pub=0, Pvt=0] [SystemComponent:1]: ABToWorker | Rate=0.00 (W=0.00,L=0.00, Q=0.00, T=0.00), Mode=LOW_LATENCY, Max Queues=(Pub=0, Pvt=0) How to deal with creative loggers? ● Accept the reality ● No one will change log format just for you ● No one will ask you prior to log format change ● Regexpish patterns are our “best friends” ● Automatic log formats analysis 24
  • 25. 20 April 2019, Tbilisi, Georgia Where to store and how long? ● Data is sensitive and should be stored on the client’s side ● Data volume is huge for limited hardware resources in the test environment ● Data retention ● HW stats ● System merics ● System configs ● Traffic ● Anonymous production data ● System configs ● Aggregated test reports Current data Historical data 2 weeks 25
  • 26. 20 April 2019, Tbilisi, Georgia How to use? ● Reporting ● Analysis ● Tests improvement 26
  • 27. 20 April 2019, Tbilisi, Georgia Reporting 27
  • 28. 20 April 2019, Tbilisi, Georgia Software Testing is Relentless Learning 28