SlideShare a Scribd company logo
Troubleshooting Apache® Ignite™
Customer Solutions, GridGain
Stan Lukyanov
2019 © GridGain Systems
GridGain and Apache Ignite
GridGain In-Memory Computing Platform
In-Memory
Data Grid
In-Memory
Database
Streaming
Analytics
Continuous
Learning Framework
Segmentation
Protection
Data Center
Replication
Monitoring &
Management
Enterprise
Security
Rolling
Upgrades
Point-in-Time
Recovery
Heterogenous
Recovery
Full, Incremental,
Continuous Backups
Network
Backups
1
2019 © GridGain Systems
Apache Ignite Support – Faster Time to Reliable Ignite
• Get up and running faster with
2 hours initial consultation
• Ensure fast, reliable Ignite with
unlimited 9x5 global support
– Unlimited web/e-mail support
– Identify bugs, workarounds
– Troubleshoot performance,
reliability issues
https://www.gridgain.com/products/services/support/support-apache-ignite
2
2019 © GridGain Systems
Agenda
3
• Monitoring
– Logging
– JMX
– GridGain Web Console
• Troubleshooting
– Network
– Storage
– Performance
• Checklist
2019 © GridGain Systems4
Monitoring
2019 © GridGain Systems
Monitoring
5
Setup it before something bad happens!
2019 © GridGain Systems
Agenda
6
• Monitoring
– Logging
– JMX
– GridGain Web Console
• Troubleshooting
– Network
– Storage
– Performance
• Checklist
2019 © GridGain Systems
Logging
7
What can you do with logs
• Manually check nodes state
• Identify issues with cluster configuration
• Add automatic parsing to report issues on the fly
– With custom or third-party tools
• Provide to GridGain Support experts
2019 © GridGain Systems
Logging
8
Configuring GridGain logs
• https://apacheignite.readme.io/docs/logging
• Use log4j 2.x integration
– Other options: j.u.logging, log4j 1.x, slf4j, custom integration
• Start Ignite in verbose mode
– ignite.sh –v
– java -DIGNITE_QUIET=false –cp ...
2019 © GridGain Systems
Logging
9
Quiet log
2019 © GridGain Systems
Logging
10
Configuring GC logs
• https://apacheignite.readme.io/docs/jvm-and-system-tuning
• Crucial to troubleshoot a lot of issues
• To configure
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps
-XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M
-Xloggc:/path/to/gc/logs/log.txt
2019 © GridGain Systems
Logging
11
How to manage logs
• Choose location for the log files
– The default location is ${IGNITE_HOME}/work/log/ignite.log
– Use a local disk with enough space (2 GB+)
– Don’t use /tmp!
– Good idea to store GridGain, application and GC logs together
• Archive old log files periodically to save on storage space
• Try to save logs for the cluster’s current uptime
2019 © GridGain Systems
Logging
Run-time configuration changes
• Helpful when you need to debug a
running deployment
• Easy to do with log4j 2.x
– Edit the log4j config file directly
• https://logging.apache.org/log4j/2.x/manual/c
onfiguration.html#AutomaticReconfiguration
– Use JMX – log4j has its own bean
• https://logging.apache.org/log4j/2.x/manual/j
mx.html
12
2019 © GridGain Systems
Agenda
13
• Monitoring
– Logging
– JMX
– GridGain Web Console
• Troubleshooting
– Network
– Storage
– Performance
2019 © GridGain Systems
JMX
14
What can you do with JMX
• Check state of various grid subsystems: storage, thread pools, etc
• Monitor metrics changing over time
• Add automatic warnings on important metrics
– With custom or third-party tools
2019 © GridGain Systems
JMX
15
How to leverage JMX beans
• Standard way to monitor Java
applications
– A lot of tools on the market
– JConsole and VisualVM are
bundled with JDK
2019 © GridGain Systems
GridGain Metrics
16
Name Description JMX Query Default
Cluster metrics Basic node information org.apache:clsLdr=*,group=Kernal,name=ClusterMetricsMXBeanImpl
org.apache:clsLdr=*,group=Kernal,name=ClusterLocalNodeMetricsMXBeanImpl
Enabled
Data region metrics Memory information org.apache:clsLdr=*,group=DataRegionMetrics,name=<region_name> Disabled
Data storage metrics Persistent storage
information
org.apache:clsLdr=*,group="Persistent Store",name=DataStorageMetrics Disabled
Cache metrics Cache statistics org.apache:clsLdr=*,group=<cache_name>,name="org.apache.ignite.internal.proc
essors.cache.CacheClusterMetricsMXBeanImpl“
org.apache:clsLdr=*,group=<cache_name>,name="org.apache.ignite.internal.proc
essors.cache.CacheLocalMetricsMXBeanImpl“
Disabled
Thread pool metrics Thread pools information org.apache:clsLdr=*,group="Thread Pools",name=<thread_pool_name> Enabled
• https://apacheignite.readme.io/docs/cluster-groups#section-cluster-group-metrics
• https://apacheignite.readme.io/docs/memory-metrics
• https://apacheignite.readme.io/docs/cache-metrics
2019 © GridGain Systems
Agenda
17
• Monitoring
– Logging
– JMX
– GridGain Web Console
• Troubleshooting
– Network
– Storage
– Performance
2019 © GridGain Systems
GridGain Tools
18
Web Console Visor GUI
• Basic version in
Ignite
• Enhanced version in
GridGain
• All-in-one solution
• Web-based
• The richest
functionality
• Available in
GridGain
• All-in-one solution
• Runs on a user’s PC
Command line tools
• Available in Ignite
• Multiple scripts
– Visor CMD
– control.sh
– SQLLine
2019 © GridGain Systems
GridGain Web Console
19
What can be done
• Manage multiple GridGain clusters through a web interface
– Monitor metrics changing over time
– Watch logs and thread dumps of the nodes
– Execute and monitor running queries
• Generate cluster configurations and RDBMS integrations
• Available online to try at https://console.gridgain.com/
• On-premise installation for production using Docker or bare metal
– https://apacheignite-tools.readme.io/docs/docker-deployment
2019 © GridGain Systems
GridGain Web Console
20
2019 © GridGain Systems
GridGain Web Console
21
2019 © GridGain Systems
GridGain Web Console
22
2019 © GridGain Systems
GridGain Web Console
23
2019 © GridGain Systems24
Troubleshooting
2019 © GridGain Systems
Agenda
25
• Monitoring
– Logging
– JMX
– GridGain Web Console
• Troubleshooting
– Network
– Storage
– Performance
• Checklist
2019 © GridGain Systems
Network troubleshooting
26
• Nodes not joining the cluster
• Nodes joining a wrong cluster
• Nodes taking a long time to connect
• Node is kicked out from the cluster
2019 © GridGain Systems
Nodes not joining the cluster
27
Symptom
• Started nodes don’t join in a single cluster – each has a cluster of its own instead
• Topology snapshot is “ver=1, <...> servers=1” on both nodes
2019 © GridGain Systems
Nodes not joining the cluster
28
Possible cause and solution
• Connection issues, firewall, etc
– Check via ping, check firewall settings
– Make sure TCP ports are open: 47500-47509, 47100-47109, 11211, 10800
• IP Finder configuration
– Use TcpDiscoveryVmIpFinder when you know all hosts in advance
• Make sure to list all IPs and ports
– A lot of other options: Kubernetes, cloud, etc
• https://apacheignite.readme.io/docs/kubernetes-deployment
• https://apacheignite.readme.io/docs/tcpip-discovery
2019 © GridGain Systems
Nodes joining a wrong cluster
29
Symptom
• Node started in a seemingly clean environment connects to some unknown cluster
• Topology snapshot is “ver=2, servers=2”
2019 © GridGain Systems
Nodes joining a wrong cluster
30
Possible cause and solution
• Multicast (TcpDiscoveryMulticastIpFinder) is used
– Only use it in closed subnets, or for a “Hello, world!”
• Wrong IPs in the configuration
– Check and fix IP finder configuration – see previous slide
2019 © GridGain Systems
Nodes taking a long time to connect
31
Symptom
• Cluster is taking a long time to start
– Or the first node is taking a long time to start
• After the nodes joined in the cluster performance is good
2019 © GridGain Systems
Nodes taking a long time to connect
32
Possible cause and solution
• Too many addresses in the IP finder
– Ignite will try to connect to each know IP
– Only list the addresses and ports you use
• Too high IgniteConfiguration.failureDetectionTimeout
– Basic timeout for most network operations
– Used for establishing a connection
– Reduce the failureDetectionTimeout to scan through the IP list faster
2019 © GridGain Systems
Node is kicked out from the cluster
33
Symptom
• Cluster nodes report that one of them has failed
[...][WARN ][disco-event-worker-#42][GridDiscoveryManager]
Node FAILED: TcpDiscoveryNode [id=13169858-92db-48a3-bc4b-
db369cab6457, addrs=[…], sockAddrs=[…], discPort=47501,
order=2, intOrder=2, lastExchangeTime=1550680047403, loc=false,
ver=2.7.2#20190206-sha1:5f8f5488, isClient=false]
• “Local node segmented” messages on the failed node
[...][WARN ][disco-event-worker-#42][GridDiscoveryManager]
Local node SEGMENTED: TcpDiscoveryNode [id=13169858-92db-48a3-
bc4b-db369cab6457, addrs=[…], sockAddrs=[…], discPort=47501,
order=2, intOrder=2, lastExchangeTime=1550680109573, loc=true,
ver=2.7.2#20190206-sha1:5f8f5488, isClient=false]
2019 © GridGain Systems
Node is kicked out from the cluster
34
Possible cause and solution
• Likely a GC pause
• To confirm
– Check for “Possible too long JVM pause” messages in the logs
– Analyze GC logs around the time of the issue
• https://gceasy.io/ can help
2019 © GridGain Systems
Node is kicked out from the cluster
35
Possible cause and solution
• To handle GC issues
– Increase heap: -Xmx16g
– Try G1 with a latency target: -XX:+UseG1GC -XX:MaxGCPauseMillis=200
– Reduce heap pressure
• Use smaller SQL queries
▪ More about SQL later
• Use IgniteCache.withKeepBinary() to avoid deserialization
▪ https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/IgniteCache.html#withKeepBinary--
• Use on-heap caching with CacheConfiguration.copyOnRead=false
▪ https://apacheignite.readme.io/docs/memory-configuration#section-on-heap-caching
– https://apacheignite.readme.io/docs/jvm-and-system-tuning
• Remedy: change failureDetectionTimeout to allow a longer inactivity
period
2019 © GridGain Systems
Agenda
36
• Monitoring
– Logging
– JMX
– GridGain Web Console
• Troubleshooting
– Network
– Storage
– Performance
• Checklist
2019 © GridGain Systems
Storage troubleshooting
37
• “Out of memory” errors
• Lost data after node restart
• Incorrect/partial data returned by SQL
• Some nodes don’t store data
2019 © GridGain Systems
Out of memory
38
Symptom
• Three kinds of “out of memory” conditions
– Java’s OutOfMemoryError exception
– IgniteOutOfMemoryException exception
– OS killing the Ignite’s JVM
2019 © GridGain Systems
Out of memory
39
Possible cause and solution: OutOfMemoryError
• Heap is too small
– Increase –Xmx
• Large SQL queries are running
– Use lazy queries
• https://apacheignite-sql.readme.io/docs/performance-and-debugging#section-result-set-lazy-
loading
– Avoid many large queries running concurrently
– Split query to reduce data set size
• Before:
▪ SELECT * FROM PERSON ORDER BY AGE
• After:
▪ SELECT * FROM PERSON WHERE AGE < 40 ORDER BY AGE
▪ SELECT * FROM PERSON WHERE AGE >= 40 ORDER BY AGE
2019 © GridGain Systems
Out of memory
40
Possible cause and solution: IgniteOutOfMemoryException
• Data region is too small
IgniteOutOfMemoryException: Out of memory in data region [name=default,
initSize=256,0 MiB, maxSize=476,8 MiB, persistenceEnabled=false] Try the
following:
^-- Increase maximum off-heap memory size (DataRegionConfiguration.maxSize)
^-- Enable Ignite persistence (DataRegionConfiguration.persistenceEnabled)
^-- Enable eviction or expiration policies
– Increase DataRegionConfiguration.maxSize
• https://apacheignite.readme.io/docs/capacity-planning
– Use Native Persistence
• https://apacheignite.readme.io/docs/distributed-persistent-store
– Use DataRegionConfiguration.pageEvictionMode=RANDOM_2_LRU
• https://apacheignite.readme.io/docs/evictions
– Use an expiry policy
• https://apacheignite.readme.io/docs/expiry-policies
2019 © GridGain Systems
Out of memory
41
Possible cause and solution: OOM killer
• Total size of the processes is greater than RAM
– Check for message
Out of memory: Kill process <PID> (java) score <SCORE> or sacrifice
child
– Make sure that total size of the processes is within bounds
– Disable overcommit to have a more graceful, in-process errors
• sysctl -w vm.overcommit_memory=2
• https://www.kernel.org/doc/Documentation/vm/overcommit-accounting
• May affect other applications, use carefully
2019 © GridGain Systems
Lost data after node restart
42
Symptom
• Some or all data was lost after one or several node restarts
2019 © GridGain Systems
Lost data after node restart
43
Possible cause and solution
• In-memory cache with zero backups
– Taking down a node will always lose data
• In-memory cache with one or more backups
– Don’t take down more nodes simultaneously than there are backups
– Wait for rebalance after bringing a node back online (use Web Console for that)
2019 © GridGain Systems
Incorrect/partial data returned by SQL
44
Symptom
• An SQL query with JOIN returns less data than expected
CREATE TABLE ORGANIZATION (ID INT PRIMARY KEY, NAME VARCHAR);
CREATE TABLE PERSON (ID INT, ORGID INT, NAME VARCHAR, PRIMARY KEY (ID, ORGID));
SELECT * FROM PERSON P JOIN ORGANIZATION O ON O.ORGID = O.ID;
2019 © GridGain Systems
Incorrect/partial data returned by SQL
45
Possible cause and solution
• Data isn’t collocated
– https://apacheignite.readme.io/docs/affinity-collocation
– Use affinity key to keep related data together
CREATE TABLE ORGANIZATION (ID INT PRIMARY KEY, NAME VARCHAR);
CREATE TABLE PERSON (ID INT, ORGID INT, NAME VARCHAR, PRIMARY KEY (ID, ORGID))
WITH “AFFINITY_KEY=ORGID”;
SELECT * FROM PERSON P JOIN ORGANIZATION O ON O.ORGID = O.ID;
– Use replicated tables – they’re collocated with all others
CREATE TABLE ORGANIZATION (ID INT PRIMARY KEY, NAME VARCHAR)
WITH “TEMPLATE=REPLICATED”;
CREATE TABLE PERSON (ID INT, ORGID INT, NAME VARCHAR, PRIMARY KEY (ID, ORGID));
SELECT * FROM PERSON P JOIN ORGANIZATION O ON O.ORGID = O.ID;
– Use distributedJoins=true to do JOINs without collocation (this is costly!)
2019 © GridGain Systems
Some nodes don’t store data
46
Symptom
• Native persistence is enabled
• Data is only stored on one
node/subset of nodes
• New nodes don’t store data
2019 © GridGain Systems
Some nodes don’t store data
47
Possible cause and solution
• Baseline topology doesn’t include some nodes
– https://apacheignite.readme.io/docs/baseline-topology
– Update the baseline after every long-term topology change (use Web Console for
that)
2019 © GridGain Systems
Agenda
48
• Monitoring
– Logging
– JMX
– GridGain Web Console
• Troubleshooting
– Network
– Storage
– Performance
• Checklist
2019 © GridGain Systems
Performance troubleshooting
49
• Throughput of persistent cache updates drops to zero
• Low throughput of SQL access
2019 © GridGain Systems
Throughput of cache updates drops to zero
50
Symptom
• Native persistence is enabled
• Write performance is good but there are periodic intervals of 0 ops/sec
2019 © GridGain Systems
Throughput of cache updates drops to zero
51
Possible cause and solution
• Updates in RAM happen faster than on disk – disk needs to catch up
– Enable write throttling
• DataStorageConfiguration.writeThrottlingEnabled=true
• Sacrifice peak performance for stable latency and throughput
– Increase checkpoint page buffer size
• Allow more pending updates in RAM
• https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Sto
re+-+under+the+hood
2019 © GridGain Systems
Low throughput of SQL access
52
Symptom
• SQL reads using indexed fields are slow
– Even for simple queries like SELECT * FROM FOO WHERE ID = 123
• Writes to an SQL-enabled cache are slow
– Even via key-value API
2019 © GridGain Systems
Low throughput of SQL access
53
Possible cause and solution
• A lot of possible causes – performance tuning is hard!
– https://apacheignite-sql.readme.io/docs/performance-and-debugging
• A common issue – indexes not fitting into maximum inline size
– Check for “Indexed columns of a row cannot be fully
inlined into index” warnings
– Do what the warning says
[2019-02-19T23:36:26,848][WARN ][main][H2TreeIndex] <CacheQueryExamplePersons> Indexed
columns of a row cannot be fully inlined into index what may lead to slowdown due to
additional data page reads, increase index inline size if needed (use INLINE_SIZE option
for CREATE INDEX command, QuerySqlField.inlineSize for annotated classes, or
QueryIndex.inlineSize for explicit QueryEntity configuration) [cacheName=PERSON,
tableName=CacheQueryExamplePersons, idxName=PERSON_FIRSTNAME_IDX, idxCols=(FIRSTNAME,
_KEY), idxType=SECONDARY, curSize=10, recommendedInlineSize=71]
2019 © GridGain Systems
Agenda
54
• Monitoring
– Logging
– JMX
– GridGain Web Console
• Troubleshooting
– Network
– Storage
– Performance
• Checklist
2019 © GridGain Systems55
Checklist
2019 © GridGain Systems
Checklist
56
Be prepared
• Setup monitoring before something happens, not after!
– Make sure logs are safely stored
– GridGain Web Console is an easy and effective way to monitor
• Properly configure the cluster
– Implement the suggestions from this presentation before you encounter the
issues they help with
• Know the common problems and how to react
– Logs often provide solutions
2019 © GridGain Systems
Checklist
57
Avoiding common problems
• Network-related issues are often caused by IP configuration
– Configure IP finder – TcpDiscoveryVmIpFinder is usually a good fit
• GC pauses cause performance and stability issues
– Use G1 GC, set a fitting heap size
– Reduce GC pressure by using lazy SQL, withKeepBinary() and on-
heap cache
• Insufficient memory will result in a crash
– Plan storage capacity in advance
– Assess required heap size during testing
– Account for other processes in the system
2019 © GridGain Systems
Checklist
58
Avoiding common problems
• Implement cluster administration processes
– For in-memory: wait for rebalance when starting or stopping nodes
– For persistence: update baseline for long-term topology changes
• Write performance may be unstable when update speed is more than the
disk speed
– Enable write throttling
• Consider performance suggestions reported in the log
– Index inline size suggestion is an example
2019 © GridGain Systems59
Questions?
2019 © GridGain Systems
Apache Ignite Resources
60
• Apache Ignite documentation
– https://apacheignite.readme.io/docs/
• Apache Ignite community resources
– user@ignite.apache.org – the mailing list
– https://ignite.apache.org/community/resources.html – other resources and
instructions
– http://apache-ignite-users.70518.x6.nabble.com/ – forum and archive
– https://stackoverflow.com/questions/tagged/ignite – StackOverflow questions
• Contact me!
– slukyanov@gridgain.com
– stanlukyanov@gmail.com
2019 © GridGain Systems
GridGain Resources
• White Papers
– Visit https://www.gridgain.com/resources/papers
• Webinars
– Visit https://www.gridgain.com/resources/webinars
• Videos
– Visit https://www.gridgain.com/resources/videos
• Free 30-Day Ultimate, Enterprise or Professional Edition Trial
– Visit https://www.gridgain.com/resources/download

More Related Content

What's hot (20)

PPTX
Introduction of netty
Bing Luo
 
PDF
Apache Spark Introduction
sudhakara st
 
PPTX
Introduction to Storm
Chandler Huang
 
PPTX
Key-Value NoSQL Database
Heman Hosainpana
 
PPTX
Design cube in Apache Kylin
Yang Li
 
PPTX
Apache Kylin on HBase: Extreme OLAP engine for big data
Shi Shao Feng
 
PDF
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
PPTX
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
PDF
Data Lakes - The Key to a Scalable Data Architecture
Zaloni
 
PDF
Why Use an Oracle Database?
Markus Michalewicz
 
PPTX
What is NoSQL and CAP Theorem
Rahul Jain
 
PDF
Log Structured Merge Tree
University of California, Santa Cruz
 
PDF
Apache Kylin - Balance Between Space and Time
DataWorks Summit
 
PPT
Oracle GoldenGate
oracleonthebrain
 
PDF
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PDF
How to Build a Scylla Database Cluster that Fits Your Needs
ScyllaDB
 
PPTX
Hive Does ACID
DataWorks Summit
 
PPTX
Hadoop technology
tipanagiriharika
 
PPTX
Snowflake + Power BI: Cloud Analytics for Everyone
Angel Abundez
 
PPTX
HBase in Practice
larsgeorge
 
Introduction of netty
Bing Luo
 
Apache Spark Introduction
sudhakara st
 
Introduction to Storm
Chandler Huang
 
Key-Value NoSQL Database
Heman Hosainpana
 
Design cube in Apache Kylin
Yang Li
 
Apache Kylin on HBase: Extreme OLAP engine for big data
Shi Shao Feng
 
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
Data Lakes - The Key to a Scalable Data Architecture
Zaloni
 
Why Use an Oracle Database?
Markus Michalewicz
 
What is NoSQL and CAP Theorem
Rahul Jain
 
Log Structured Merge Tree
University of California, Santa Cruz
 
Apache Kylin - Balance Between Space and Time
DataWorks Summit
 
Oracle GoldenGate
oracleonthebrain
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
How to Build a Scylla Database Cluster that Fits Your Needs
ScyllaDB
 
Hive Does ACID
DataWorks Summit
 
Hadoop technology
tipanagiriharika
 
Snowflake + Power BI: Cloud Analytics for Everyone
Angel Abundez
 
HBase in Practice
larsgeorge
 

Similar to Troubleshooting Apache® Ignite™ (20)

PPTX
In-Memory Computing Essentials for Software Engineers
Denis Magda
 
PPTX
Container and Test Automation Management Practices in TrendMicro
Jen-Chieh Ko
 
PPTX
On Cloud Nine: How to be happy migrating your in-memory computing platform to...
Stephen Darlington
 
PDF
Network Automation Journey, A systems engineer NetOps perspective
Walid Shaari
 
PDF
Kubernetes as data platform
Lars Albertsson
 
PDF
DevOpsDaysRiga 2018: Eric Skoglund, Lars Albertsson - Kubernetes as data plat...
DevOpsDays Riga
 
PDF
OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolki...
NETWAYS
 
PPTX
Viavi_TeraVM Core Emulator.pptx
mani723
 
PPTX
Zendcon scaling magento
Mathew Beane
 
PPTX
ROI for IP Address Management (IPAM) Solutions
SolarWinds
 
PPSX
November 2013 HUG: Real-time analytics with in-memory grid
Yahoo Developer Network
 
PDF
CFD on Power
Ganesan Narayanasamy
 
PDF
QRadar_CEddfdfdsfdfdfdfdfdfdfdfdfdfdff.pdf
mindhackers161
 
PPTX
EVOLVE'15 | Maximize | Gary Gamitian | Informatica
Evolve The Adobe Digital Marketing Community
 
PDF
Digdag Updates 2020 July
You Yamagata
 
PDF
Distributed deep learning reference architecture v3.2l
Ganesan Narayanasamy
 
PPTX
Securing Hadoop @eBay
DataWorks Summit
 
PDF
Kubecon seattle 2018 workshop slides
Weaveworks
 
PDF
Grails 4: Upgrade your Game!
Zachary Klein
 
PDF
MuleSoft Manchester Meetup #2 slides 29th October 2019
Ieva Navickaite
 
In-Memory Computing Essentials for Software Engineers
Denis Magda
 
Container and Test Automation Management Practices in TrendMicro
Jen-Chieh Ko
 
On Cloud Nine: How to be happy migrating your in-memory computing platform to...
Stephen Darlington
 
Network Automation Journey, A systems engineer NetOps perspective
Walid Shaari
 
Kubernetes as data platform
Lars Albertsson
 
DevOpsDaysRiga 2018: Eric Skoglund, Lars Albertsson - Kubernetes as data plat...
DevOpsDays Riga
 
OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolki...
NETWAYS
 
Viavi_TeraVM Core Emulator.pptx
mani723
 
Zendcon scaling magento
Mathew Beane
 
ROI for IP Address Management (IPAM) Solutions
SolarWinds
 
November 2013 HUG: Real-time analytics with in-memory grid
Yahoo Developer Network
 
CFD on Power
Ganesan Narayanasamy
 
QRadar_CEddfdfdsfdfdfdfdfdfdfdfdfdfdff.pdf
mindhackers161
 
EVOLVE'15 | Maximize | Gary Gamitian | Informatica
Evolve The Adobe Digital Marketing Community
 
Digdag Updates 2020 July
You Yamagata
 
Distributed deep learning reference architecture v3.2l
Ganesan Narayanasamy
 
Securing Hadoop @eBay
DataWorks Summit
 
Kubecon seattle 2018 workshop slides
Weaveworks
 
Grails 4: Upgrade your Game!
Zachary Klein
 
MuleSoft Manchester Meetup #2 slides 29th October 2019
Ieva Navickaite
 
Ad

More from Tom Diederich (12)

PDF
Tom Diederich portfolio presentation (updated Nov. 18, 2016)
Tom Diederich
 
PDF
How to build & grow online communities: with Tom Diederich
Tom Diederich
 
PDF
How to build a production-ready in-memory-based application in 1 hour
Tom Diederich
 
PPTX
Ingesting streaming data for analysis in apache ignite (stream sets theme)
Tom Diederich
 
PDF
IT Modernization in Practice
Tom Diederich
 
PDF
In-Memory Key Value Store (KVS) in FPGA for Ultra Low Latency and High Throug...
Tom Diederich
 
PDF
Machine learning and deep learning with Apache Ignite
Tom Diederich
 
PPTX
Heimdall Data: "Increase Application Performance with SQL Auto-Caching; No Co...
Tom Diederich
 
PDF
Improving Apache Spark™ In-Memory Computing with Apache Ignite™
Tom Diederich
 
PDF
Comparing Apache Ignite and Cassandra for Hybrid Transactional/Analytical Pro...
Tom Diederich
 
PDF
“Building consistent and highly available distributed systems with Apache Ign...
Tom Diederich
 
PPTX
Quick MySQL performance check
Tom Diederich
 
Tom Diederich portfolio presentation (updated Nov. 18, 2016)
Tom Diederich
 
How to build & grow online communities: with Tom Diederich
Tom Diederich
 
How to build a production-ready in-memory-based application in 1 hour
Tom Diederich
 
Ingesting streaming data for analysis in apache ignite (stream sets theme)
Tom Diederich
 
IT Modernization in Practice
Tom Diederich
 
In-Memory Key Value Store (KVS) in FPGA for Ultra Low Latency and High Throug...
Tom Diederich
 
Machine learning and deep learning with Apache Ignite
Tom Diederich
 
Heimdall Data: "Increase Application Performance with SQL Auto-Caching; No Co...
Tom Diederich
 
Improving Apache Spark™ In-Memory Computing with Apache Ignite™
Tom Diederich
 
Comparing Apache Ignite and Cassandra for Hybrid Transactional/Analytical Pro...
Tom Diederich
 
“Building consistent and highly available distributed systems with Apache Ign...
Tom Diederich
 
Quick MySQL performance check
Tom Diederich
 
Ad

Recently uploaded (20)

PPTX
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PDF
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
PDF
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
PPTX
Human Resources Information System (HRIS)
Amity University, Patna
 
PPTX
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
PDF
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PPTX
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
PPTX
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PDF
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
PDF
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
PDF
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
PPTX
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PDF
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PDF
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
PDF
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
Human Resources Information System (HRIS)
Amity University, Patna
 
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 

Troubleshooting Apache® Ignite™

  • 1. Troubleshooting Apache® Ignite™ Customer Solutions, GridGain Stan Lukyanov
  • 2. 2019 © GridGain Systems GridGain and Apache Ignite GridGain In-Memory Computing Platform In-Memory Data Grid In-Memory Database Streaming Analytics Continuous Learning Framework Segmentation Protection Data Center Replication Monitoring & Management Enterprise Security Rolling Upgrades Point-in-Time Recovery Heterogenous Recovery Full, Incremental, Continuous Backups Network Backups 1
  • 3. 2019 © GridGain Systems Apache Ignite Support – Faster Time to Reliable Ignite • Get up and running faster with 2 hours initial consultation • Ensure fast, reliable Ignite with unlimited 9x5 global support – Unlimited web/e-mail support – Identify bugs, workarounds – Troubleshoot performance, reliability issues https://www.gridgain.com/products/services/support/support-apache-ignite 2
  • 4. 2019 © GridGain Systems Agenda 3 • Monitoring – Logging – JMX – GridGain Web Console • Troubleshooting – Network – Storage – Performance • Checklist
  • 5. 2019 © GridGain Systems4 Monitoring
  • 6. 2019 © GridGain Systems Monitoring 5 Setup it before something bad happens!
  • 7. 2019 © GridGain Systems Agenda 6 • Monitoring – Logging – JMX – GridGain Web Console • Troubleshooting – Network – Storage – Performance • Checklist
  • 8. 2019 © GridGain Systems Logging 7 What can you do with logs • Manually check nodes state • Identify issues with cluster configuration • Add automatic parsing to report issues on the fly – With custom or third-party tools • Provide to GridGain Support experts
  • 9. 2019 © GridGain Systems Logging 8 Configuring GridGain logs • https://apacheignite.readme.io/docs/logging • Use log4j 2.x integration – Other options: j.u.logging, log4j 1.x, slf4j, custom integration • Start Ignite in verbose mode – ignite.sh –v – java -DIGNITE_QUIET=false –cp ...
  • 10. 2019 © GridGain Systems Logging 9 Quiet log
  • 11. 2019 © GridGain Systems Logging 10 Configuring GC logs • https://apacheignite.readme.io/docs/jvm-and-system-tuning • Crucial to troubleshoot a lot of issues • To configure -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M -Xloggc:/path/to/gc/logs/log.txt
  • 12. 2019 © GridGain Systems Logging 11 How to manage logs • Choose location for the log files – The default location is ${IGNITE_HOME}/work/log/ignite.log – Use a local disk with enough space (2 GB+) – Don’t use /tmp! – Good idea to store GridGain, application and GC logs together • Archive old log files periodically to save on storage space • Try to save logs for the cluster’s current uptime
  • 13. 2019 © GridGain Systems Logging Run-time configuration changes • Helpful when you need to debug a running deployment • Easy to do with log4j 2.x – Edit the log4j config file directly • https://logging.apache.org/log4j/2.x/manual/c onfiguration.html#AutomaticReconfiguration – Use JMX – log4j has its own bean • https://logging.apache.org/log4j/2.x/manual/j mx.html 12
  • 14. 2019 © GridGain Systems Agenda 13 • Monitoring – Logging – JMX – GridGain Web Console • Troubleshooting – Network – Storage – Performance
  • 15. 2019 © GridGain Systems JMX 14 What can you do with JMX • Check state of various grid subsystems: storage, thread pools, etc • Monitor metrics changing over time • Add automatic warnings on important metrics – With custom or third-party tools
  • 16. 2019 © GridGain Systems JMX 15 How to leverage JMX beans • Standard way to monitor Java applications – A lot of tools on the market – JConsole and VisualVM are bundled with JDK
  • 17. 2019 © GridGain Systems GridGain Metrics 16 Name Description JMX Query Default Cluster metrics Basic node information org.apache:clsLdr=*,group=Kernal,name=ClusterMetricsMXBeanImpl org.apache:clsLdr=*,group=Kernal,name=ClusterLocalNodeMetricsMXBeanImpl Enabled Data region metrics Memory information org.apache:clsLdr=*,group=DataRegionMetrics,name=<region_name> Disabled Data storage metrics Persistent storage information org.apache:clsLdr=*,group="Persistent Store",name=DataStorageMetrics Disabled Cache metrics Cache statistics org.apache:clsLdr=*,group=<cache_name>,name="org.apache.ignite.internal.proc essors.cache.CacheClusterMetricsMXBeanImpl“ org.apache:clsLdr=*,group=<cache_name>,name="org.apache.ignite.internal.proc essors.cache.CacheLocalMetricsMXBeanImpl“ Disabled Thread pool metrics Thread pools information org.apache:clsLdr=*,group="Thread Pools",name=<thread_pool_name> Enabled • https://apacheignite.readme.io/docs/cluster-groups#section-cluster-group-metrics • https://apacheignite.readme.io/docs/memory-metrics • https://apacheignite.readme.io/docs/cache-metrics
  • 18. 2019 © GridGain Systems Agenda 17 • Monitoring – Logging – JMX – GridGain Web Console • Troubleshooting – Network – Storage – Performance
  • 19. 2019 © GridGain Systems GridGain Tools 18 Web Console Visor GUI • Basic version in Ignite • Enhanced version in GridGain • All-in-one solution • Web-based • The richest functionality • Available in GridGain • All-in-one solution • Runs on a user’s PC Command line tools • Available in Ignite • Multiple scripts – Visor CMD – control.sh – SQLLine
  • 20. 2019 © GridGain Systems GridGain Web Console 19 What can be done • Manage multiple GridGain clusters through a web interface – Monitor metrics changing over time – Watch logs and thread dumps of the nodes – Execute and monitor running queries • Generate cluster configurations and RDBMS integrations • Available online to try at https://console.gridgain.com/ • On-premise installation for production using Docker or bare metal – https://apacheignite-tools.readme.io/docs/docker-deployment
  • 21. 2019 © GridGain Systems GridGain Web Console 20
  • 22. 2019 © GridGain Systems GridGain Web Console 21
  • 23. 2019 © GridGain Systems GridGain Web Console 22
  • 24. 2019 © GridGain Systems GridGain Web Console 23
  • 25. 2019 © GridGain Systems24 Troubleshooting
  • 26. 2019 © GridGain Systems Agenda 25 • Monitoring – Logging – JMX – GridGain Web Console • Troubleshooting – Network – Storage – Performance • Checklist
  • 27. 2019 © GridGain Systems Network troubleshooting 26 • Nodes not joining the cluster • Nodes joining a wrong cluster • Nodes taking a long time to connect • Node is kicked out from the cluster
  • 28. 2019 © GridGain Systems Nodes not joining the cluster 27 Symptom • Started nodes don’t join in a single cluster – each has a cluster of its own instead • Topology snapshot is “ver=1, <...> servers=1” on both nodes
  • 29. 2019 © GridGain Systems Nodes not joining the cluster 28 Possible cause and solution • Connection issues, firewall, etc – Check via ping, check firewall settings – Make sure TCP ports are open: 47500-47509, 47100-47109, 11211, 10800 • IP Finder configuration – Use TcpDiscoveryVmIpFinder when you know all hosts in advance • Make sure to list all IPs and ports – A lot of other options: Kubernetes, cloud, etc • https://apacheignite.readme.io/docs/kubernetes-deployment • https://apacheignite.readme.io/docs/tcpip-discovery
  • 30. 2019 © GridGain Systems Nodes joining a wrong cluster 29 Symptom • Node started in a seemingly clean environment connects to some unknown cluster • Topology snapshot is “ver=2, servers=2”
  • 31. 2019 © GridGain Systems Nodes joining a wrong cluster 30 Possible cause and solution • Multicast (TcpDiscoveryMulticastIpFinder) is used – Only use it in closed subnets, or for a “Hello, world!” • Wrong IPs in the configuration – Check and fix IP finder configuration – see previous slide
  • 32. 2019 © GridGain Systems Nodes taking a long time to connect 31 Symptom • Cluster is taking a long time to start – Or the first node is taking a long time to start • After the nodes joined in the cluster performance is good
  • 33. 2019 © GridGain Systems Nodes taking a long time to connect 32 Possible cause and solution • Too many addresses in the IP finder – Ignite will try to connect to each know IP – Only list the addresses and ports you use • Too high IgniteConfiguration.failureDetectionTimeout – Basic timeout for most network operations – Used for establishing a connection – Reduce the failureDetectionTimeout to scan through the IP list faster
  • 34. 2019 © GridGain Systems Node is kicked out from the cluster 33 Symptom • Cluster nodes report that one of them has failed [...][WARN ][disco-event-worker-#42][GridDiscoveryManager] Node FAILED: TcpDiscoveryNode [id=13169858-92db-48a3-bc4b- db369cab6457, addrs=[…], sockAddrs=[…], discPort=47501, order=2, intOrder=2, lastExchangeTime=1550680047403, loc=false, ver=2.7.2#20190206-sha1:5f8f5488, isClient=false] • “Local node segmented” messages on the failed node [...][WARN ][disco-event-worker-#42][GridDiscoveryManager] Local node SEGMENTED: TcpDiscoveryNode [id=13169858-92db-48a3- bc4b-db369cab6457, addrs=[…], sockAddrs=[…], discPort=47501, order=2, intOrder=2, lastExchangeTime=1550680109573, loc=true, ver=2.7.2#20190206-sha1:5f8f5488, isClient=false]
  • 35. 2019 © GridGain Systems Node is kicked out from the cluster 34 Possible cause and solution • Likely a GC pause • To confirm – Check for “Possible too long JVM pause” messages in the logs – Analyze GC logs around the time of the issue • https://gceasy.io/ can help
  • 36. 2019 © GridGain Systems Node is kicked out from the cluster 35 Possible cause and solution • To handle GC issues – Increase heap: -Xmx16g – Try G1 with a latency target: -XX:+UseG1GC -XX:MaxGCPauseMillis=200 – Reduce heap pressure • Use smaller SQL queries ▪ More about SQL later • Use IgniteCache.withKeepBinary() to avoid deserialization ▪ https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/IgniteCache.html#withKeepBinary-- • Use on-heap caching with CacheConfiguration.copyOnRead=false ▪ https://apacheignite.readme.io/docs/memory-configuration#section-on-heap-caching – https://apacheignite.readme.io/docs/jvm-and-system-tuning • Remedy: change failureDetectionTimeout to allow a longer inactivity period
  • 37. 2019 © GridGain Systems Agenda 36 • Monitoring – Logging – JMX – GridGain Web Console • Troubleshooting – Network – Storage – Performance • Checklist
  • 38. 2019 © GridGain Systems Storage troubleshooting 37 • “Out of memory” errors • Lost data after node restart • Incorrect/partial data returned by SQL • Some nodes don’t store data
  • 39. 2019 © GridGain Systems Out of memory 38 Symptom • Three kinds of “out of memory” conditions – Java’s OutOfMemoryError exception – IgniteOutOfMemoryException exception – OS killing the Ignite’s JVM
  • 40. 2019 © GridGain Systems Out of memory 39 Possible cause and solution: OutOfMemoryError • Heap is too small – Increase –Xmx • Large SQL queries are running – Use lazy queries • https://apacheignite-sql.readme.io/docs/performance-and-debugging#section-result-set-lazy- loading – Avoid many large queries running concurrently – Split query to reduce data set size • Before: ▪ SELECT * FROM PERSON ORDER BY AGE • After: ▪ SELECT * FROM PERSON WHERE AGE < 40 ORDER BY AGE ▪ SELECT * FROM PERSON WHERE AGE >= 40 ORDER BY AGE
  • 41. 2019 © GridGain Systems Out of memory 40 Possible cause and solution: IgniteOutOfMemoryException • Data region is too small IgniteOutOfMemoryException: Out of memory in data region [name=default, initSize=256,0 MiB, maxSize=476,8 MiB, persistenceEnabled=false] Try the following: ^-- Increase maximum off-heap memory size (DataRegionConfiguration.maxSize) ^-- Enable Ignite persistence (DataRegionConfiguration.persistenceEnabled) ^-- Enable eviction or expiration policies – Increase DataRegionConfiguration.maxSize • https://apacheignite.readme.io/docs/capacity-planning – Use Native Persistence • https://apacheignite.readme.io/docs/distributed-persistent-store – Use DataRegionConfiguration.pageEvictionMode=RANDOM_2_LRU • https://apacheignite.readme.io/docs/evictions – Use an expiry policy • https://apacheignite.readme.io/docs/expiry-policies
  • 42. 2019 © GridGain Systems Out of memory 41 Possible cause and solution: OOM killer • Total size of the processes is greater than RAM – Check for message Out of memory: Kill process <PID> (java) score <SCORE> or sacrifice child – Make sure that total size of the processes is within bounds – Disable overcommit to have a more graceful, in-process errors • sysctl -w vm.overcommit_memory=2 • https://www.kernel.org/doc/Documentation/vm/overcommit-accounting • May affect other applications, use carefully
  • 43. 2019 © GridGain Systems Lost data after node restart 42 Symptom • Some or all data was lost after one or several node restarts
  • 44. 2019 © GridGain Systems Lost data after node restart 43 Possible cause and solution • In-memory cache with zero backups – Taking down a node will always lose data • In-memory cache with one or more backups – Don’t take down more nodes simultaneously than there are backups – Wait for rebalance after bringing a node back online (use Web Console for that)
  • 45. 2019 © GridGain Systems Incorrect/partial data returned by SQL 44 Symptom • An SQL query with JOIN returns less data than expected CREATE TABLE ORGANIZATION (ID INT PRIMARY KEY, NAME VARCHAR); CREATE TABLE PERSON (ID INT, ORGID INT, NAME VARCHAR, PRIMARY KEY (ID, ORGID)); SELECT * FROM PERSON P JOIN ORGANIZATION O ON O.ORGID = O.ID;
  • 46. 2019 © GridGain Systems Incorrect/partial data returned by SQL 45 Possible cause and solution • Data isn’t collocated – https://apacheignite.readme.io/docs/affinity-collocation – Use affinity key to keep related data together CREATE TABLE ORGANIZATION (ID INT PRIMARY KEY, NAME VARCHAR); CREATE TABLE PERSON (ID INT, ORGID INT, NAME VARCHAR, PRIMARY KEY (ID, ORGID)) WITH “AFFINITY_KEY=ORGID”; SELECT * FROM PERSON P JOIN ORGANIZATION O ON O.ORGID = O.ID; – Use replicated tables – they’re collocated with all others CREATE TABLE ORGANIZATION (ID INT PRIMARY KEY, NAME VARCHAR) WITH “TEMPLATE=REPLICATED”; CREATE TABLE PERSON (ID INT, ORGID INT, NAME VARCHAR, PRIMARY KEY (ID, ORGID)); SELECT * FROM PERSON P JOIN ORGANIZATION O ON O.ORGID = O.ID; – Use distributedJoins=true to do JOINs without collocation (this is costly!)
  • 47. 2019 © GridGain Systems Some nodes don’t store data 46 Symptom • Native persistence is enabled • Data is only stored on one node/subset of nodes • New nodes don’t store data
  • 48. 2019 © GridGain Systems Some nodes don’t store data 47 Possible cause and solution • Baseline topology doesn’t include some nodes – https://apacheignite.readme.io/docs/baseline-topology – Update the baseline after every long-term topology change (use Web Console for that)
  • 49. 2019 © GridGain Systems Agenda 48 • Monitoring – Logging – JMX – GridGain Web Console • Troubleshooting – Network – Storage – Performance • Checklist
  • 50. 2019 © GridGain Systems Performance troubleshooting 49 • Throughput of persistent cache updates drops to zero • Low throughput of SQL access
  • 51. 2019 © GridGain Systems Throughput of cache updates drops to zero 50 Symptom • Native persistence is enabled • Write performance is good but there are periodic intervals of 0 ops/sec
  • 52. 2019 © GridGain Systems Throughput of cache updates drops to zero 51 Possible cause and solution • Updates in RAM happen faster than on disk – disk needs to catch up – Enable write throttling • DataStorageConfiguration.writeThrottlingEnabled=true • Sacrifice peak performance for stable latency and throughput – Increase checkpoint page buffer size • Allow more pending updates in RAM • https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Sto re+-+under+the+hood
  • 53. 2019 © GridGain Systems Low throughput of SQL access 52 Symptom • SQL reads using indexed fields are slow – Even for simple queries like SELECT * FROM FOO WHERE ID = 123 • Writes to an SQL-enabled cache are slow – Even via key-value API
  • 54. 2019 © GridGain Systems Low throughput of SQL access 53 Possible cause and solution • A lot of possible causes – performance tuning is hard! – https://apacheignite-sql.readme.io/docs/performance-and-debugging • A common issue – indexes not fitting into maximum inline size – Check for “Indexed columns of a row cannot be fully inlined into index” warnings – Do what the warning says [2019-02-19T23:36:26,848][WARN ][main][H2TreeIndex] <CacheQueryExamplePersons> Indexed columns of a row cannot be fully inlined into index what may lead to slowdown due to additional data page reads, increase index inline size if needed (use INLINE_SIZE option for CREATE INDEX command, QuerySqlField.inlineSize for annotated classes, or QueryIndex.inlineSize for explicit QueryEntity configuration) [cacheName=PERSON, tableName=CacheQueryExamplePersons, idxName=PERSON_FIRSTNAME_IDX, idxCols=(FIRSTNAME, _KEY), idxType=SECONDARY, curSize=10, recommendedInlineSize=71]
  • 55. 2019 © GridGain Systems Agenda 54 • Monitoring – Logging – JMX – GridGain Web Console • Troubleshooting – Network – Storage – Performance • Checklist
  • 56. 2019 © GridGain Systems55 Checklist
  • 57. 2019 © GridGain Systems Checklist 56 Be prepared • Setup monitoring before something happens, not after! – Make sure logs are safely stored – GridGain Web Console is an easy and effective way to monitor • Properly configure the cluster – Implement the suggestions from this presentation before you encounter the issues they help with • Know the common problems and how to react – Logs often provide solutions
  • 58. 2019 © GridGain Systems Checklist 57 Avoiding common problems • Network-related issues are often caused by IP configuration – Configure IP finder – TcpDiscoveryVmIpFinder is usually a good fit • GC pauses cause performance and stability issues – Use G1 GC, set a fitting heap size – Reduce GC pressure by using lazy SQL, withKeepBinary() and on- heap cache • Insufficient memory will result in a crash – Plan storage capacity in advance – Assess required heap size during testing – Account for other processes in the system
  • 59. 2019 © GridGain Systems Checklist 58 Avoiding common problems • Implement cluster administration processes – For in-memory: wait for rebalance when starting or stopping nodes – For persistence: update baseline for long-term topology changes • Write performance may be unstable when update speed is more than the disk speed – Enable write throttling • Consider performance suggestions reported in the log – Index inline size suggestion is an example
  • 60. 2019 © GridGain Systems59 Questions?
  • 61. 2019 © GridGain Systems Apache Ignite Resources 60 • Apache Ignite documentation – https://apacheignite.readme.io/docs/ • Apache Ignite community resources – [email protected] – the mailing list – https://ignite.apache.org/community/resources.html – other resources and instructions – http://apache-ignite-users.70518.x6.nabble.com/ – forum and archive – https://stackoverflow.com/questions/tagged/ignite – StackOverflow questions • Contact me! – [email protected][email protected]
  • 62. 2019 © GridGain Systems GridGain Resources • White Papers – Visit https://www.gridgain.com/resources/papers • Webinars – Visit https://www.gridgain.com/resources/webinars • Videos – Visit https://www.gridgain.com/resources/videos • Free 30-Day Ultimate, Enterprise or Professional Edition Trial – Visit https://www.gridgain.com/resources/download