SlideShare a Scribd company logo
Best Practices for Scaling an
InfluxEnterprise Cluster
About PayPal
PayPal enables digital and mobile payments on behalf of consumers and
merchants worldwide.
Our platform gives our 254 million active account holders the confidence
to connect and transact in more than 200 markets around the world.
© 2018 PayPal Inc. Confidential and proprietary.
About the presentation
Assumptions:
• Working Knowledge of InfluxDB
What this presentation IS about:
• Scalability
What this presentation IS NOT about:
• Alerting
• How to send money to your family
© 2018 PayPal Inc. Confidential and proprietary.
The Need for a New Host Monitoring Solution
PayPal was looking for a ”Host Monitoring” solution to replace antiquated
monitoring systems. Following were some of our requirements:
• A reliable and extensible agent sitting on all systems to monitor basic OS
system metrics such as CPU, Disk, Memory, third-party applications and
databases
• Time-series database backend for reporting history
• Ability to monitor multiple Docker containers with a single agent
• Smart alerting based on time-series data
© 2018 PayPal Inc. Confidential and proprietary.
Why we chose InfluxData and its TICK stack
• Telegraf provides an extensible plugin-based architecture for monitoring
for all OS’s, applications, docker containers
• InfluxDB provides a fast, scalable time-series database
• Chronograf has an intuitive data explorer and query builder
• Kapacitor provides smart alerting capabilities
© 2018 PayPal Inc. Confidential and proprietary.
Why we chose InfluxData and its TICK stack
• Single binaries for each component provide simplicity in deployment
• Influx’s architecture allows for further scalability & customization
• Technical Support
© 2018 PayPal Inc. Confidential and proprietary.
THE JOURNEY TO SCALABILITY
The Journey – Iteration 1
© 2018 PayPal Inc. Confidential and proprietary.
Why this didn’t work well:
• Too many writers with small payloads
• No buffering/retention in case there is
backpressure from the database
• Single Point of Failure (SPOF) condition as
soon as one node fails
InfluxDB Data Nodes
~20k Telegraf Agents
The Journey – Iteration 2
© 2018 PayPal Inc. Confidential and proprietary.
What changed:
• Added Telegraf aggregators between
Telegraf agents and cluster
What helped:
• Aggregators reduce the number of writes
by combining multiple tiny payloads into
larger ones for fewer writes to the DB
What didn’t work well:
• No buffering/retention in case DB is
unresponsive
• SPOF upon node failure
Aggregators
~20k Telegraf Agents
InfluxDB Data Nodes
The Journey – Iteration 3
© 2018 PayPal Inc. Confidential and proprietary.
What changed:
• Added Message Queues (MQ) for
aggregators to send to
• Added ”publishers” to consume
messages from MQ and post to InfluxDB
• Added additional data nodes to cluster
and set RF to 3
What helped:
• MQ’s prevent data loss when DB is
unavailable
• Smart publishers watch for back-pressure
and back-off until cluster is available
• Prevent immediate SPOF condition with
RF of 3
Aggregators
MQ Bus
Publishers
~20k Telegraf Agents
InfluxDB Data Nodes
ABOUT AGGREGATORS
What are Telegraf Aggregators and how do they help?
• Send larger payloads with fewer writes to InfluxDB having better overall
performance
• Send to multiple outputs and formats
• Filter, Sanitize & Process at scale quickly
(tagpass, tagdrop, namedrop, namepass)
© 2018 PayPal Inc. Confidential and proprietary.
Aggregator Example
© 2018 PayPal Inc. Confidential and proprietary.
Telegraf
Agents
Telegraf
Aggregators
InfluxDB
ABOUT MESSAGE QUEUES
How do Message Queues help?
• Store messages until final destination is available
• Provide a common platform for publishing to just about any other
system
• Use “fan-out” exchanges to multiple queues to replicate to other Influx
clusters or nodes.
• Manage TTL for messages
© 2018 PayPal Inc. Confidential and proprietary.
ABOUT PUBLISHERS
Publishers – The smart way to manage message flows
• Publishers consume from message queues
• Publishers send heartbeats to ensure the database is ready to accept
messages
• Publishers are “smart” in that they will only deliver messages while the
database is up
© 2018 PayPal Inc. Confidential and proprietary.
BRINGING IT ALL TOGETHER
Aggregators + 1 Message Queue + 1 Publisher
© 2018 PayPal Inc. Confidential and proprietary.
Telegraf
Agents
Telegraf
Aggregators
InfluxDB
Message
Queue(s)
Influx
Publisher
Aggregators + 2 Message Queues + 2 Publishers
© 2018 PayPal Inc. Confidential and proprietary.
Telegraf
Agents
Telegraf
Aggregators InfluxDBInfluxDB MQ Influx Publisher
RDBMS MQ RDBMS Publisher RDBMS
OTHER TIPS & BEST PRACTICES
Other Tips & Best Practices for Scaling your InfluxEnterprise Cluster
• Collect only the data you need
• Start with a smaller retention policy first
• Check for which retention policy is best for you at:
https://www.influxdata.com/blog/simplifying-influxdb-retention-policy-
best-practices/
© 2018 PayPal Inc. Confidential and proprietary.
Avoid being greedy!
Other Tips & Best Practices for Scaling your InfluxEnterprise Cluster
• Ensure all nodes in your cluster have the same characteristics such as
RAM, Disk IO/IOPS, Network IO
• A single slower node can actually impair the performance of your whole
cluster
© 2018 PayPal Inc. Confidential and proprietary.
Ensure all cluster nodes have the same characteristics
Other Tips & Best Practices for Scaling your InfluxEnterprise Cluster
© 2018 PayPal Inc. Confidential and proprietary.
Ensure all cluster nodes have the same machine characteristics – Slow node example
Data Node crash
Data Node removed
Mem Usage slightly
higher on rest of
cluster
HH eventually
depleted
Heap Size balanced
on remaining nodes
100% Memory
Usage on slower
node
Large Heap Size on
slower node
Persistent HH
Other Tips & Best Practices for Scaling your InfluxEnterprise Cluster
• Maintain regional clusters and roll up alerting to a single cluster, or
• Maintain regional Telegraf receivers/aggregators and forward to a single
cluster
• Wayfair provided some excellent examples using a combination of
Telegraf receivers in multiple regions that publish to Kafka, then
eventually publish to InfluxDB
• More Info:
https://tech.wayfair.com/2018/04/time-series-data-at-wayfair/
© 2018 PayPal Inc. Confidential and proprietary.
Where possible – Divide & Conquer
Other Tips & Best Practices for Scaling your InfluxEnterprise Cluster
• From https://www.easyitblog.info/2017/11/14/influxdb-in-iot-world-aws-part-2/ is the
comparison of query & write execution times on different AWS instances for two
different types of load.
• Summary: CPU power is important for write throughput.
RAM and disk speed (SSD) are important for query throughput.
© 2018 PayPal Inc. Confidential and proprietary.
Scale Up! (Use beefier hardware)
Closing Remarks
© 2018 PayPal Inc. Confidential and proprietary.
“If you can’t measure it, you can’t manage it.”
Closing Remarks
© 2018 PayPal Inc. Confidential and proprietary.
Thank You!
APPENDIX
Other Tips & Best Practices for Scaling your InfluxEnterprise Cluster
• Implement ”MOM.” Check for multiple metrics to
ensure your cluster is healthy and will be able to
scale in the future:
• Hinted Handoff
• CPU/Mem/Disk
• WAL/HH directories
• IOPS_In_Progress
• Etc.
• Send cluster stats & data to a different instance or
cluster
• Avoid deleting measurements. This is very
expensive and taxing to the system (uncompress,
remove data, re-compress over multiple time-based
shards).
• Use Influx’s native line protocol as much as possible.
Perform data transformations only when necessary
(such as to RDBMS)
© 2018 PayPal Inc. Confidential and proprietary.
• If you can’t revive a dead data node quickly, It’s best
to remove from the cluster so data nodes don’t keep
retrying to send to that node and free up some
cycles and prevent HH buildup.
• Enable TSI
• Adjust Fsync-delay for slower disks
• Adjust WAL setting
• Adjust Compaction Interval to an interval knowing
you won’t receive data past a certain time so re-
compaction doesn’t have to reoccur
• Take advantage of jitter feature in Telegraf to
streamline delivery of metrics to your influx cluster
• Ensure VMs are on different HVs
• Use Flash (SSD) Storage for data nodes
• Use a reliable deployment strategy such as Puppet or
Ansible.
• Have a “nanny” like SystemD in OEL7+ or
supervisord to ensure your services stay running
Miscellaneous Tips
Sample Aggregator Configuration
# AMQP Output for Influx Publisher
[[outputs.amqp]]
brokers = ["amqp://AMQP_UID:AMQP_PWD@AMQ_VIP:5672"]
max_messages = 10000
exchange = "influxdb_exchange"
# AMQP Output (in JSON format)
[[outputs.amqp]]
brokers = ["amqp://AMQP_UID:AMQP_PWD@AMQ_VIP:5672"]
max_messages = 10000
exchange = ”json_exchange_name"
data_format = "json"
json_timestamp_units = "1s”
# Don’t send metrics for any container measurement
namedrop = [”container*"]
© 2018 PayPal Inc. Confidential and proprietary.
[agent]
metric_batch_size = 5000
# Influx HTTP write listener
[[inputs.http_listener]]
## Address and port to host HTTP listener on
service_address = ":8086"
# Don't collect metrics where
# host tag contains –HOSTNAME-
[inputs.http_listener.tagdrop]
host = ["*-HOSTNAME-*"]
Other Tips & Best Practices for Scaling your InfluxEnterprise Cluster
/etc/security/limits.d/<filename>.conf:
uid hard nproc 16384
uid soft nproc 16384
uid hard nofile 500000
uid soft nofile 500000
© 2018 PayPal Inc. Confidential and proprietary.
Check your system’s LIMITS config. Ensure you have handle enough room for network connections and files
Check/Adjust System Limits – Especially for Aggregator & Data Nodes
/etc/sysctl.d/<filename>.conf
net.core.somaxconn = 1024
net.ipv4.tcp_tw_recycle = 0
net.ipv4.tcp_tw_reuse = 0
net.ipv4.tcp_orphan_retries = 1
net.ipv4.tcp_fin_timeout = 20
net.ipv4.tcp_max_orphans = 8192
net.ipv4.ip_local_port_range = 18000 65535
net.ipv4.tcp_rmem = 4096 25165824 25165824
net.ipv4.tcp_wmem = 4096 65536 25165824
net.core.optmem_max = 25165824
net.ipv4.tcp_congestion_control = htcp
net.core.default_qdisc = fq
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_syn_retries = 2
net.ipv4.tcp_synack_retries = 2
net.ipv4.tcp_max_syn_backlog = 4096
Best Practices for Scaling an InfluxEnterprise Cluster

More Related Content

What's hot (20)

PDF
Vasilis Papavasiliou [Mist.io] | Integrating Telegraf, InfluxDB and Mist to M...
InfluxData
 
PDF
InfluxDB 2.0: Dashboarding 101 by David G. Simmons
InfluxData
 
PPTX
Setting Up InfluxDB for IoT by David G Simmons
InfluxData
 
PPTX
How an Open Marine Standard, InfluxDB and Grafana Are Used to Improve Boating...
InfluxData
 
PDF
How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...
InfluxData
 
PDF
How to Enable Industrial Decarbonization with Node-RED and InfluxDB
InfluxData
 
PPTX
Container Monitoring Best Practices Using AWS and InfluxData by Gunnar Aasen
InfluxData
 
PDF
Spacecrafts Made Simple: How Loft Orbital Delivers Unparalleled Speed-to-Spac...
InfluxData
 
PPTX
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
InfluxData
 
PPTX
Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...
InfluxData
 
PDF
InfluxDB Live Product Training
InfluxData
 
PPTX
How to Create a Modern IIoT Monitoring Solution On iOS Using Swift, MQTT and ...
InfluxData
 
PPTX
How EnerKey Using InfluxDB Saves Customers Millions by Detecting Energy Usage...
InfluxData
 
PPTX
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
InfluxData
 
PDF
Monitoring, Alerting, and Tasks as Code by Russ Savage, Director of Product M...
InfluxData
 
PDF
Intro to Kapacitor for Alerting and Anomaly Detection
InfluxData
 
PDF
Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...
Flink Forward
 
PDF
Introduction to InfluxDB 2.0 & Your First Flux Query by Sonia Gupta, Develope...
InfluxData
 
PDF
Catalogs - Turning a Set of Parquet Files into a Data Set
InfluxData
 
PDF
InfluxDB 2.0 Client Libraries by Noah Crowley
InfluxData
 
Vasilis Papavasiliou [Mist.io] | Integrating Telegraf, InfluxDB and Mist to M...
InfluxData
 
InfluxDB 2.0: Dashboarding 101 by David G. Simmons
InfluxData
 
Setting Up InfluxDB for IoT by David G Simmons
InfluxData
 
How an Open Marine Standard, InfluxDB and Grafana Are Used to Improve Boating...
InfluxData
 
How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...
InfluxData
 
How to Enable Industrial Decarbonization with Node-RED and InfluxDB
InfluxData
 
Container Monitoring Best Practices Using AWS and InfluxData by Gunnar Aasen
InfluxData
 
Spacecrafts Made Simple: How Loft Orbital Delivers Unparalleled Speed-to-Spac...
InfluxData
 
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
InfluxData
 
Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...
InfluxData
 
InfluxDB Live Product Training
InfluxData
 
How to Create a Modern IIoT Monitoring Solution On iOS Using Swift, MQTT and ...
InfluxData
 
How EnerKey Using InfluxDB Saves Customers Millions by Detecting Energy Usage...
InfluxData
 
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
InfluxData
 
Monitoring, Alerting, and Tasks as Code by Russ Savage, Director of Product M...
InfluxData
 
Intro to Kapacitor for Alerting and Anomaly Detection
InfluxData
 
Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...
Flink Forward
 
Introduction to InfluxDB 2.0 & Your First Flux Query by Sonia Gupta, Develope...
InfluxData
 
Catalogs - Turning a Set of Parquet Files into a Data Set
InfluxData
 
InfluxDB 2.0 Client Libraries by Noah Crowley
InfluxData
 

Similar to Best Practices for Scaling an InfluxEnterprise Cluster (20)

PPTX
InfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
InfluxData
 
PDF
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxData
 
PDF
InfluxDB Enterprise Architectural Patterns | Craig Hobbs | InfluxData
InfluxData
 
PPTX
Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...
InfluxData
 
PDF
Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...
InfluxData
 
PDF
3 Reasons to Select Time Series Platforms for Cloud Native Applications Monit...
DevOps.com
 
PDF
Gilmore, Palani [InfluxData] | Use Case: Monitoring / Observability | InfluxD...
InfluxData
 
PPTX
ARCHITECTING INFLUXENTERPRISE FOR SUCCESS
InfluxData
 
PDF
Monitoring InfluxEnterprise
InfluxData
 
PDF
Charles Mahler [InfluxData] | Use Case: Networking Monitoring | InfluxDays 2022
InfluxData
 
PPTX
Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience NA 2020
InfluxData
 
PDF
Announcing: Native MQTT Integration with HiveMQ and InfluxDB Cloud
InfluxData
 
PDF
A True Story About Database Orchestration
InfluxData
 
PDF
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
InfluxData
 
PDF
Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData
InfluxData
 
PDF
Big Data and OpenStack, a Love Story: Michael Still, Rackspace
OpenStack
 
PPTX
Balaji Palani [InfluxData] | Path to InfluxDB 2.0: Seamlessly Migrate Your 1....
InfluxData
 
PPTX
InfluxDB Roadmap: What’s New and What’s Coming
InfluxData
 
PPTX
Black Friday Brilliance Managing a Billion Transactions with Tech, Tactics, a...
Jamie Coleman
 
PDF
2016 - 10 questions you should answer before building a new microservice
devopsdaysaustin
 
InfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
InfluxData
 
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxData
 
InfluxDB Enterprise Architectural Patterns | Craig Hobbs | InfluxData
InfluxData
 
Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...
InfluxData
 
Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...
InfluxData
 
3 Reasons to Select Time Series Platforms for Cloud Native Applications Monit...
DevOps.com
 
Gilmore, Palani [InfluxData] | Use Case: Monitoring / Observability | InfluxD...
InfluxData
 
ARCHITECTING INFLUXENTERPRISE FOR SUCCESS
InfluxData
 
Monitoring InfluxEnterprise
InfluxData
 
Charles Mahler [InfluxData] | Use Case: Networking Monitoring | InfluxDays 2022
InfluxData
 
Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience NA 2020
InfluxData
 
Announcing: Native MQTT Integration with HiveMQ and InfluxDB Cloud
InfluxData
 
A True Story About Database Orchestration
InfluxData
 
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
InfluxData
 
Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData
InfluxData
 
Big Data and OpenStack, a Love Story: Michael Still, Rackspace
OpenStack
 
Balaji Palani [InfluxData] | Path to InfluxDB 2.0: Seamlessly Migrate Your 1....
InfluxData
 
InfluxDB Roadmap: What’s New and What’s Coming
InfluxData
 
Black Friday Brilliance Managing a Billion Transactions with Tech, Tactics, a...
Jamie Coleman
 
2016 - 10 questions you should answer before building a new microservice
devopsdaysaustin
 
Ad

More from InfluxData (20)

PPTX
Announcing InfluxDB Clustered
InfluxData
 
PDF
Best Practices for Leveraging the Apache Arrow Ecosystem
InfluxData
 
PDF
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
InfluxData
 
PDF
Power Your Predictive Analytics with InfluxDB
InfluxData
 
PDF
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
InfluxData
 
PDF
Build an Edge-to-Cloud Solution with the MING Stack
InfluxData
 
PDF
Meet the Founders: An Open Discussion About Rewriting Using Rust
InfluxData
 
PDF
Introducing InfluxDB Cloud Dedicated
InfluxData
 
PDF
Gain Better Observability with OpenTelemetry and InfluxDB
InfluxData
 
PPTX
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
InfluxData
 
PDF
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
InfluxData
 
PPTX
Introducing InfluxDB’s New Time Series Database Storage Engine
InfluxData
 
PDF
Start Automating InfluxDB Deployments at the Edge with balena
InfluxData
 
PDF
Understanding InfluxDB’s New Storage Engine
InfluxData
 
PPTX
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
InfluxData
 
PDF
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
InfluxData
 
PDF
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
InfluxData
 
PDF
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
InfluxData
 
PDF
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
InfluxData
 
PDF
Paul Dix [InfluxData] The Journey of InfluxDB | InfluxDays 2022
InfluxData
 
Announcing InfluxDB Clustered
InfluxData
 
Best Practices for Leveraging the Apache Arrow Ecosystem
InfluxData
 
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
InfluxData
 
Power Your Predictive Analytics with InfluxDB
InfluxData
 
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
InfluxData
 
Build an Edge-to-Cloud Solution with the MING Stack
InfluxData
 
Meet the Founders: An Open Discussion About Rewriting Using Rust
InfluxData
 
Introducing InfluxDB Cloud Dedicated
InfluxData
 
Gain Better Observability with OpenTelemetry and InfluxDB
InfluxData
 
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
InfluxData
 
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
InfluxData
 
Introducing InfluxDB’s New Time Series Database Storage Engine
InfluxData
 
Start Automating InfluxDB Deployments at the Edge with balena
InfluxData
 
Understanding InfluxDB’s New Storage Engine
InfluxData
 
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
InfluxData
 
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
InfluxData
 
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
InfluxData
 
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
InfluxData
 
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
InfluxData
 
Paul Dix [InfluxData] The Journey of InfluxDB | InfluxDays 2022
InfluxData
 
Ad

Recently uploaded (20)

PDF
Latest Scam Shocking the USA in 2025.pdf
onlinescamreport4
 
PPTX
AI at Your Side: Boost Impact Without Losing the Human Touch (SXSW 2026 Meet ...
maytaldahan
 
PPTX
The Internet of Things (IoT) refers to a vast network of interconnected devic...
chethana8182
 
PDF
LB# 820-1889_051-7370_C000.schematic.pdf
matheusalbuquerqueco3
 
PPTX
办理方法西班牙假毕业证蒙德拉贡大学成绩单MULetter文凭样本
xxxihn4u
 
PPTX
B2B_Ecommerce_Internship_Simranpreet.pptx
LipakshiJindal
 
PPTX
原版北不列颠哥伦比亚大学毕业证文凭UNBC成绩单2025年新版在线制作学位证书
e7nw4o4
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PDF
UI/UX Developer Guide: Tools, Trends, and Tips for 2025
Penguin peak
 
PDF
How Much GB RAM Do You Need for Coding? 5 Powerful Reasons 8GB Is More Than E...
freeshopbudget
 
PPTX
The Internet of Things (IoT) refers to a vast network of interconnected devic...
chethana8182
 
PPTX
Perkembangan Perangkat jaringan komputer dan telekomunikasi 3.pptx
Prayudha3
 
PDF
LOGENVIDAD DANNYFGRETRRTTRRRTRRRRRRRRR.pdf
juan456ytpro
 
PPT
1965 INDO PAK WAR which Pak will never forget.ppt
sanjaychief112
 
PDF
GEO Strategy 2025: Complete Presentation Deck for AI-Powered Customer Acquisi...
Zam Man
 
PDF
Cybersecurity Awareness Presentation ppt.
banodhaharshita
 
PPTX
MSadfadsfafdadfccadradfT_Presentation.pptx
pahalaedward2
 
PPTX
The Latest Scam Shocking the USA in 2025.pptx
onlinescamreport4
 
PPT
Introduction to dns domain name syst.ppt
MUHAMMADKAVISHSHABAN
 
PPTX
Google SGE SEO: 5 Critical Changes That Could Wreck Your Rankings in 2025
Reversed Out Creative
 
Latest Scam Shocking the USA in 2025.pdf
onlinescamreport4
 
AI at Your Side: Boost Impact Without Losing the Human Touch (SXSW 2026 Meet ...
maytaldahan
 
The Internet of Things (IoT) refers to a vast network of interconnected devic...
chethana8182
 
LB# 820-1889_051-7370_C000.schematic.pdf
matheusalbuquerqueco3
 
办理方法西班牙假毕业证蒙德拉贡大学成绩单MULetter文凭样本
xxxihn4u
 
B2B_Ecommerce_Internship_Simranpreet.pptx
LipakshiJindal
 
原版北不列颠哥伦比亚大学毕业证文凭UNBC成绩单2025年新版在线制作学位证书
e7nw4o4
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
UI/UX Developer Guide: Tools, Trends, and Tips for 2025
Penguin peak
 
How Much GB RAM Do You Need for Coding? 5 Powerful Reasons 8GB Is More Than E...
freeshopbudget
 
The Internet of Things (IoT) refers to a vast network of interconnected devic...
chethana8182
 
Perkembangan Perangkat jaringan komputer dan telekomunikasi 3.pptx
Prayudha3
 
LOGENVIDAD DANNYFGRETRRTTRRRTRRRRRRRRR.pdf
juan456ytpro
 
1965 INDO PAK WAR which Pak will never forget.ppt
sanjaychief112
 
GEO Strategy 2025: Complete Presentation Deck for AI-Powered Customer Acquisi...
Zam Man
 
Cybersecurity Awareness Presentation ppt.
banodhaharshita
 
MSadfadsfafdadfccadradfT_Presentation.pptx
pahalaedward2
 
The Latest Scam Shocking the USA in 2025.pptx
onlinescamreport4
 
Introduction to dns domain name syst.ppt
MUHAMMADKAVISHSHABAN
 
Google SGE SEO: 5 Critical Changes That Could Wreck Your Rankings in 2025
Reversed Out Creative
 

Best Practices for Scaling an InfluxEnterprise Cluster

  • 1. Best Practices for Scaling an InfluxEnterprise Cluster
  • 2. About PayPal PayPal enables digital and mobile payments on behalf of consumers and merchants worldwide. Our platform gives our 254 million active account holders the confidence to connect and transact in more than 200 markets around the world. © 2018 PayPal Inc. Confidential and proprietary.
  • 3. About the presentation Assumptions: • Working Knowledge of InfluxDB What this presentation IS about: • Scalability What this presentation IS NOT about: • Alerting • How to send money to your family © 2018 PayPal Inc. Confidential and proprietary.
  • 4. The Need for a New Host Monitoring Solution PayPal was looking for a ”Host Monitoring” solution to replace antiquated monitoring systems. Following were some of our requirements: • A reliable and extensible agent sitting on all systems to monitor basic OS system metrics such as CPU, Disk, Memory, third-party applications and databases • Time-series database backend for reporting history • Ability to monitor multiple Docker containers with a single agent • Smart alerting based on time-series data © 2018 PayPal Inc. Confidential and proprietary.
  • 5. Why we chose InfluxData and its TICK stack • Telegraf provides an extensible plugin-based architecture for monitoring for all OS’s, applications, docker containers • InfluxDB provides a fast, scalable time-series database • Chronograf has an intuitive data explorer and query builder • Kapacitor provides smart alerting capabilities © 2018 PayPal Inc. Confidential and proprietary.
  • 6. Why we chose InfluxData and its TICK stack • Single binaries for each component provide simplicity in deployment • Influx’s architecture allows for further scalability & customization • Technical Support © 2018 PayPal Inc. Confidential and proprietary.
  • 7. THE JOURNEY TO SCALABILITY
  • 8. The Journey – Iteration 1 © 2018 PayPal Inc. Confidential and proprietary. Why this didn’t work well: • Too many writers with small payloads • No buffering/retention in case there is backpressure from the database • Single Point of Failure (SPOF) condition as soon as one node fails InfluxDB Data Nodes ~20k Telegraf Agents
  • 9. The Journey – Iteration 2 © 2018 PayPal Inc. Confidential and proprietary. What changed: • Added Telegraf aggregators between Telegraf agents and cluster What helped: • Aggregators reduce the number of writes by combining multiple tiny payloads into larger ones for fewer writes to the DB What didn’t work well: • No buffering/retention in case DB is unresponsive • SPOF upon node failure Aggregators ~20k Telegraf Agents InfluxDB Data Nodes
  • 10. The Journey – Iteration 3 © 2018 PayPal Inc. Confidential and proprietary. What changed: • Added Message Queues (MQ) for aggregators to send to • Added ”publishers” to consume messages from MQ and post to InfluxDB • Added additional data nodes to cluster and set RF to 3 What helped: • MQ’s prevent data loss when DB is unavailable • Smart publishers watch for back-pressure and back-off until cluster is available • Prevent immediate SPOF condition with RF of 3 Aggregators MQ Bus Publishers ~20k Telegraf Agents InfluxDB Data Nodes
  • 12. What are Telegraf Aggregators and how do they help? • Send larger payloads with fewer writes to InfluxDB having better overall performance • Send to multiple outputs and formats • Filter, Sanitize & Process at scale quickly (tagpass, tagdrop, namedrop, namepass) © 2018 PayPal Inc. Confidential and proprietary.
  • 13. Aggregator Example © 2018 PayPal Inc. Confidential and proprietary. Telegraf Agents Telegraf Aggregators InfluxDB
  • 15. How do Message Queues help? • Store messages until final destination is available • Provide a common platform for publishing to just about any other system • Use “fan-out” exchanges to multiple queues to replicate to other Influx clusters or nodes. • Manage TTL for messages © 2018 PayPal Inc. Confidential and proprietary.
  • 17. Publishers – The smart way to manage message flows • Publishers consume from message queues • Publishers send heartbeats to ensure the database is ready to accept messages • Publishers are “smart” in that they will only deliver messages while the database is up © 2018 PayPal Inc. Confidential and proprietary.
  • 18. BRINGING IT ALL TOGETHER
  • 19. Aggregators + 1 Message Queue + 1 Publisher © 2018 PayPal Inc. Confidential and proprietary. Telegraf Agents Telegraf Aggregators InfluxDB Message Queue(s) Influx Publisher
  • 20. Aggregators + 2 Message Queues + 2 Publishers © 2018 PayPal Inc. Confidential and proprietary. Telegraf Agents Telegraf Aggregators InfluxDBInfluxDB MQ Influx Publisher RDBMS MQ RDBMS Publisher RDBMS
  • 21. OTHER TIPS & BEST PRACTICES
  • 22. Other Tips & Best Practices for Scaling your InfluxEnterprise Cluster • Collect only the data you need • Start with a smaller retention policy first • Check for which retention policy is best for you at: https://www.influxdata.com/blog/simplifying-influxdb-retention-policy- best-practices/ © 2018 PayPal Inc. Confidential and proprietary. Avoid being greedy!
  • 23. Other Tips & Best Practices for Scaling your InfluxEnterprise Cluster • Ensure all nodes in your cluster have the same characteristics such as RAM, Disk IO/IOPS, Network IO • A single slower node can actually impair the performance of your whole cluster © 2018 PayPal Inc. Confidential and proprietary. Ensure all cluster nodes have the same characteristics
  • 24. Other Tips & Best Practices for Scaling your InfluxEnterprise Cluster © 2018 PayPal Inc. Confidential and proprietary. Ensure all cluster nodes have the same machine characteristics – Slow node example Data Node crash Data Node removed Mem Usage slightly higher on rest of cluster HH eventually depleted Heap Size balanced on remaining nodes 100% Memory Usage on slower node Large Heap Size on slower node Persistent HH
  • 25. Other Tips & Best Practices for Scaling your InfluxEnterprise Cluster • Maintain regional clusters and roll up alerting to a single cluster, or • Maintain regional Telegraf receivers/aggregators and forward to a single cluster • Wayfair provided some excellent examples using a combination of Telegraf receivers in multiple regions that publish to Kafka, then eventually publish to InfluxDB • More Info: https://tech.wayfair.com/2018/04/time-series-data-at-wayfair/ © 2018 PayPal Inc. Confidential and proprietary. Where possible – Divide & Conquer
  • 26. Other Tips & Best Practices for Scaling your InfluxEnterprise Cluster • From https://www.easyitblog.info/2017/11/14/influxdb-in-iot-world-aws-part-2/ is the comparison of query & write execution times on different AWS instances for two different types of load. • Summary: CPU power is important for write throughput. RAM and disk speed (SSD) are important for query throughput. © 2018 PayPal Inc. Confidential and proprietary. Scale Up! (Use beefier hardware)
  • 27. Closing Remarks © 2018 PayPal Inc. Confidential and proprietary. “If you can’t measure it, you can’t manage it.”
  • 28. Closing Remarks © 2018 PayPal Inc. Confidential and proprietary. Thank You!
  • 30. Other Tips & Best Practices for Scaling your InfluxEnterprise Cluster • Implement ”MOM.” Check for multiple metrics to ensure your cluster is healthy and will be able to scale in the future: • Hinted Handoff • CPU/Mem/Disk • WAL/HH directories • IOPS_In_Progress • Etc. • Send cluster stats & data to a different instance or cluster • Avoid deleting measurements. This is very expensive and taxing to the system (uncompress, remove data, re-compress over multiple time-based shards). • Use Influx’s native line protocol as much as possible. Perform data transformations only when necessary (such as to RDBMS) © 2018 PayPal Inc. Confidential and proprietary. • If you can’t revive a dead data node quickly, It’s best to remove from the cluster so data nodes don’t keep retrying to send to that node and free up some cycles and prevent HH buildup. • Enable TSI • Adjust Fsync-delay for slower disks • Adjust WAL setting • Adjust Compaction Interval to an interval knowing you won’t receive data past a certain time so re- compaction doesn’t have to reoccur • Take advantage of jitter feature in Telegraf to streamline delivery of metrics to your influx cluster • Ensure VMs are on different HVs • Use Flash (SSD) Storage for data nodes • Use a reliable deployment strategy such as Puppet or Ansible. • Have a “nanny” like SystemD in OEL7+ or supervisord to ensure your services stay running Miscellaneous Tips
  • 31. Sample Aggregator Configuration # AMQP Output for Influx Publisher [[outputs.amqp]] brokers = ["amqp://AMQP_UID:AMQP_PWD@AMQ_VIP:5672"] max_messages = 10000 exchange = "influxdb_exchange" # AMQP Output (in JSON format) [[outputs.amqp]] brokers = ["amqp://AMQP_UID:AMQP_PWD@AMQ_VIP:5672"] max_messages = 10000 exchange = ”json_exchange_name" data_format = "json" json_timestamp_units = "1s” # Don’t send metrics for any container measurement namedrop = [”container*"] © 2018 PayPal Inc. Confidential and proprietary. [agent] metric_batch_size = 5000 # Influx HTTP write listener [[inputs.http_listener]] ## Address and port to host HTTP listener on service_address = ":8086" # Don't collect metrics where # host tag contains –HOSTNAME- [inputs.http_listener.tagdrop] host = ["*-HOSTNAME-*"]
  • 32. Other Tips & Best Practices for Scaling your InfluxEnterprise Cluster /etc/security/limits.d/<filename>.conf: uid hard nproc 16384 uid soft nproc 16384 uid hard nofile 500000 uid soft nofile 500000 © 2018 PayPal Inc. Confidential and proprietary. Check your system’s LIMITS config. Ensure you have handle enough room for network connections and files Check/Adjust System Limits – Especially for Aggregator & Data Nodes /etc/sysctl.d/<filename>.conf net.core.somaxconn = 1024 net.ipv4.tcp_tw_recycle = 0 net.ipv4.tcp_tw_reuse = 0 net.ipv4.tcp_orphan_retries = 1 net.ipv4.tcp_fin_timeout = 20 net.ipv4.tcp_max_orphans = 8192 net.ipv4.ip_local_port_range = 18000 65535 net.ipv4.tcp_rmem = 4096 25165824 25165824 net.ipv4.tcp_wmem = 4096 65536 25165824 net.core.optmem_max = 25165824 net.ipv4.tcp_congestion_control = htcp net.core.default_qdisc = fq net.ipv4.tcp_window_scaling = 1 net.ipv4.tcp_syncookies = 1 net.ipv4.tcp_syn_retries = 2 net.ipv4.tcp_synack_retries = 2 net.ipv4.tcp_max_syn_backlog = 4096