Troubleshooting Apache® Ignite™

Troubleshooting Apache® Ignite™
Customer Solutions, GridGain
Stan Lukyanov

2019 © GridGain Systems
GridGain and Apache Ignite
GridGain In-Memory Computing Platform
In-Memory
Data Grid
In-Memory
Database
Streaming
Analytics
Continuous
Learning Framework
Segmentation
Protection
Data Center
Replication
Monitoring &
Management
Enterprise
Security
Rolling
Upgrades
Point-in-Time
Recovery
Heterogenous
Recovery
Full, Incremental,
Continuous Backups
Network
Backups
1

Apache Ignite Support – Faster Time to Reliable Ignite
• Get up and running faster with
2 hours initial consultation
• Ensure fast, reliable Ignite with
unlimited 9x5 global support
– Unlimited web/e-mail support
– Identify bugs, workarounds
– Troubleshoot performance,
reliability issues
https://www.gridgain.com/products/services/support/support-apache-ignite
2

Agenda
3
• Monitoring
– Logging
– JMX
– GridGain Web Console
• Troubleshooting
– Network
– Storage
– Performance
• Checklist

2019 © GridGain Systems4
Monitoring

Monitoring
5
Setup it before something bad happens!

Agenda
6
• Monitoring
– Logging
– JMX
• Troubleshooting
– Network
– Storage
– Performance
• Checklist

Logging
7
What can you do with logs
• Manually check nodes state
• Identify issues with cluster configuration
• Add automatic parsing to report issues on the fly
– With custom or third-party tools
• Provide to GridGain Support experts

Logging
8
Configuring GridGain logs
• https://apacheignite.readme.io/docs/logging
• Use log4j 2.x integration
– Other options: j.u.logging, log4j 1.x, slf4j, custom integration
• Start Ignite in verbose mode
– ignite.sh –v
– java -DIGNITE_QUIET=false –cp ...

Logging
9
Quiet log

Logging
10
Configuring GC logs
• https://apacheignite.readme.io/docs/jvm-and-system-tuning
• Crucial to troubleshoot a lot of issues
• To configure
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps
-XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M
-Xloggc:/path/to/gc/logs/log.txt

Logging
11
How to manage logs
• Choose location for the log files
– The default location is ${IGNITE_HOME}/work/log/ignite.log
– Use a local disk with enough space (2 GB+)
– Don’t use /tmp!
– Good idea to store GridGain, application and GC logs together
• Archive old log files periodically to save on storage space
• Try to save logs for the cluster’s current uptime

Logging
Run-time configuration changes
• Helpful when you need to debug a
running deployment
• Easy to do with log4j 2.x
– Edit the log4j config file directly
• https://logging.apache.org/log4j/2.x/manual/c
onfiguration.html#AutomaticReconfiguration
– Use JMX – log4j has its own bean
• https://logging.apache.org/log4j/2.x/manual/j
mx.html
12

Agenda
13
• Monitoring
– Logging
– JMX
• Troubleshooting
– Network
– Storage
– Performance

JMX
14
What can you do with JMX
• Check state of various grid subsystems: storage, thread pools, etc
• Monitor metrics changing over time
• Add automatic warnings on important metrics
– With custom or third-party tools

JMX
15
How to leverage JMX beans
• Standard way to monitor Java
applications
– A lot of tools on the market
– JConsole and VisualVM are
bundled with JDK

GridGain Metrics
16
Name Description JMX Query Default
Cluster metrics Basic node information org.apache:clsLdr=*,group=Kernal,name=ClusterMetricsMXBeanImpl
org.apache:clsLdr=*,group=Kernal,name=ClusterLocalNodeMetricsMXBeanImpl
Enabled
Data region metrics Memory information org.apache:clsLdr=*,group=DataRegionMetrics,name=<region_name> Disabled
Data storage metrics Persistent storage
information
org.apache:clsLdr=*,group="Persistent Store",name=DataStorageMetrics Disabled
Cache metrics Cache statistics org.apache:clsLdr=*,group=<cache_name>,name="org.apache.ignite.internal.proc
essors.cache.CacheClusterMetricsMXBeanImpl“
org.apache:clsLdr=*,group=<cache_name>,name="org.apache.ignite.internal.proc
essors.cache.CacheLocalMetricsMXBeanImpl“
Disabled
Thread pool metrics Thread pools information org.apache:clsLdr=*,group="Thread Pools",name=<thread_pool_name> Enabled
• https://apacheignite.readme.io/docs/cluster-groups#section-cluster-group-metrics
• https://apacheignite.readme.io/docs/memory-metrics
• https://apacheignite.readme.io/docs/cache-metrics

Agenda
17
• Monitoring
– Logging
– JMX
• Troubleshooting
– Network
– Storage
– Performance

GridGain Tools
18
Web Console Visor GUI
• Basic version in
Ignite
• Enhanced version in
GridGain
• All-in-one solution
• Web-based
• The richest
functionality
• Available in
GridGain
• All-in-one solution
• Runs on a user’s PC
Command line tools
• Available in Ignite
• Multiple scripts
– Visor CMD
– control.sh
– SQLLine

GridGain Web Console
19
What can be done
• Manage multiple GridGain clusters through a web interface
– Monitor metrics changing over time
– Watch logs and thread dumps of the nodes
– Execute and monitor running queries
• Generate cluster configurations and RDBMS integrations
• Available online to try at https://console.gridgain.com/
• On-premise installation for production using Docker or bare metal
– https://apacheignite-tools.readme.io/docs/docker-deployment

20

21

22

23

Troubleshooting

Agenda
25
• Monitoring
– Logging
– JMX
• Troubleshooting
– Network
– Storage
– Performance
• Checklist

Network troubleshooting
26
• Nodes not joining the cluster
• Nodes joining a wrong cluster
• Nodes taking a long time to connect
• Node is kicked out from the cluster

Nodes not joining the cluster
27
Symptom
• Started nodes don’t join in a single cluster – each has a cluster of its own instead
• Topology snapshot is “ver=1, <...> servers=1” on both nodes

Nodes not joining the cluster
28
Possible cause and solution
• Connection issues, firewall, etc
– Check via ping, check firewall settings
– Make sure TCP ports are open: 47500-47509, 47100-47109, 11211, 10800
• IP Finder configuration
– Use TcpDiscoveryVmIpFinder when you know all hosts in advance
• Make sure to list all IPs and ports
– A lot of other options: Kubernetes, cloud, etc
• https://apacheignite.readme.io/docs/kubernetes-deployment
• https://apacheignite.readme.io/docs/tcpip-discovery

Nodes joining a wrong cluster
29
Symptom
• Node started in a seemingly clean environment connects to some unknown cluster
• Topology snapshot is “ver=2, servers=2”

Nodes joining a wrong cluster
30
• Multicast (TcpDiscoveryMulticastIpFinder) is used
– Only use it in closed subnets, or for a “Hello, world!”
• Wrong IPs in the configuration
– Check and fix IP finder configuration – see previous slide

Nodes taking a long time to connect
31
Symptom
• Cluster is taking a long time to start
– Or the first node is taking a long time to start
• After the nodes joined in the cluster performance is good

Nodes taking a long time to connect
32
• Too many addresses in the IP finder
– Ignite will try to connect to each know IP
– Only list the addresses and ports you use
• Too high IgniteConfiguration.failureDetectionTimeout
– Basic timeout for most network operations
– Used for establishing a connection
– Reduce the failureDetectionTimeout to scan through the IP list faster

Node is kicked out from the cluster
33
Symptom
• Cluster nodes report that one of them has failed
[...][WARN ][disco-event-worker-#42][GridDiscoveryManager]
Node FAILED: TcpDiscoveryNode [id=13169858-92db-48a3-bc4b-
db369cab6457, addrs=[…], sockAddrs=[…], discPort=47501,
order=2, intOrder=2, lastExchangeTime=1550680047403, loc=false,
ver=2.7.2#20190206-sha1:5f8f5488, isClient=false]
• “Local node segmented” messages on the failed node
[...][WARN ][disco-event-worker-#42][GridDiscoveryManager]
Local node SEGMENTED: TcpDiscoveryNode [id=13169858-92db-48a3-
bc4b-db369cab6457, addrs=[…], sockAddrs=[…], discPort=47501,
order=2, intOrder=2, lastExchangeTime=1550680109573, loc=true,
ver=2.7.2#20190206-sha1:5f8f5488, isClient=false]

34
• Likely a GC pause
• To confirm
– Check for “Possible too long JVM pause” messages in the logs
– Analyze GC logs around the time of the issue
• https://gceasy.io/ can help

35
• To handle GC issues
– Increase heap: -Xmx16g
– Try G1 with a latency target: -XX:+UseG1GC -XX:MaxGCPauseMillis=200
– Reduce heap pressure
• Use smaller SQL queries
▪ More about SQL later
• Use IgniteCache.withKeepBinary() to avoid deserialization
▪ https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/IgniteCache.html#withKeepBinary--
• Use on-heap caching with CacheConfiguration.copyOnRead=false
▪ https://apacheignite.readme.io/docs/memory-configuration#section-on-heap-caching
– https://apacheignite.readme.io/docs/jvm-and-system-tuning
• Remedy: change failureDetectionTimeout to allow a longer inactivity
period

Agenda
36
• Monitoring
– Logging
– JMX
• Troubleshooting
– Network
– Storage
– Performance
• Checklist

Storage troubleshooting
37
• “Out of memory” errors
• Lost data after node restart
• Incorrect/partial data returned by SQL
• Some nodes don’t store data

Out of memory
38
Symptom
• Three kinds of “out of memory” conditions
– Java’s OutOfMemoryError exception
– IgniteOutOfMemoryException exception
– OS killing the Ignite’s JVM

Out of memory
39
Possible cause and solution: OutOfMemoryError
• Heap is too small
– Increase –Xmx
• Large SQL queries are running
– Use lazy queries
• https://apacheignite-sql.readme.io/docs/performance-and-debugging#section-result-set-lazy-
loading
– Avoid many large queries running concurrently
– Split query to reduce data set size
• Before:
▪ SELECT * FROM PERSON ORDER BY AGE
• After:
▪ SELECT * FROM PERSON WHERE AGE < 40 ORDER BY AGE
▪ SELECT * FROM PERSON WHERE AGE >= 40 ORDER BY AGE

Out of memory
40
Possible cause and solution: IgniteOutOfMemoryException
• Data region is too small
IgniteOutOfMemoryException: Out of memory in data region [name=default,
initSize=256,0 MiB, maxSize=476,8 MiB, persistenceEnabled=false] Try the
following:
^-- Increase maximum off-heap memory size (DataRegionConfiguration.maxSize)
^-- Enable Ignite persistence (DataRegionConfiguration.persistenceEnabled)
^-- Enable eviction or expiration policies
– Increase DataRegionConfiguration.maxSize
• https://apacheignite.readme.io/docs/capacity-planning
– Use Native Persistence
• https://apacheignite.readme.io/docs/distributed-persistent-store
– Use DataRegionConfiguration.pageEvictionMode=RANDOM_2_LRU
• https://apacheignite.readme.io/docs/evictions
– Use an expiry policy
• https://apacheignite.readme.io/docs/expiry-policies

Out of memory
41
Possible cause and solution: OOM killer
• Total size of the processes is greater than RAM
– Check for message
Out of memory: Kill process <PID> (java) score <SCORE> or sacrifice
child
– Make sure that total size of the processes is within bounds
– Disable overcommit to have a more graceful, in-process errors
• sysctl -w vm.overcommit_memory=2
• https://www.kernel.org/doc/Documentation/vm/overcommit-accounting
• May affect other applications, use carefully

Lost data after node restart
42
Symptom
• Some or all data was lost after one or several node restarts

Lost data after node restart
43
• In-memory cache with zero backups
– Taking down a node will always lose data
• In-memory cache with one or more backups
– Don’t take down more nodes simultaneously than there are backups
– Wait for rebalance after bringing a node back online (use Web Console for that)

Incorrect/partial data returned by SQL
44
Symptom
• An SQL query with JOIN returns less data than expected
CREATE TABLE ORGANIZATION (ID INT PRIMARY KEY, NAME VARCHAR);
CREATE TABLE PERSON (ID INT, ORGID INT, NAME VARCHAR, PRIMARY KEY (ID, ORGID));
SELECT * FROM PERSON P JOIN ORGANIZATION O ON O.ORGID = O.ID;

Incorrect/partial data returned by SQL
45
• Data isn’t collocated
– https://apacheignite.readme.io/docs/affinity-collocation
– Use affinity key to keep related data together
CREATE TABLE ORGANIZATION (ID INT PRIMARY KEY, NAME VARCHAR);
CREATE TABLE PERSON (ID INT, ORGID INT, NAME VARCHAR, PRIMARY KEY (ID, ORGID))
WITH “AFFINITY_KEY=ORGID”;
– Use replicated tables – they’re collocated with all others
CREATE TABLE ORGANIZATION (ID INT PRIMARY KEY, NAME VARCHAR)
WITH “TEMPLATE=REPLICATED”;
CREATE TABLE PERSON (ID INT, ORGID INT, NAME VARCHAR, PRIMARY KEY (ID, ORGID));
– Use distributedJoins=true to do JOINs without collocation (this is costly!)

Some nodes don’t store data
46
Symptom
• Native persistence is enabled
• Data is only stored on one
node/subset of nodes
• New nodes don’t store data

Some nodes don’t store data
47
• Baseline topology doesn’t include some nodes
– https://apacheignite.readme.io/docs/baseline-topology
– Update the baseline after every long-term topology change (use Web Console for
that)

Agenda
48
• Monitoring
– Logging
– JMX
• Troubleshooting
– Network
– Storage
– Performance
• Checklist

Performance troubleshooting
49
• Throughput of persistent cache updates drops to zero
• Low throughput of SQL access

Throughput of cache updates drops to zero
50
Symptom
• Native persistence is enabled
• Write performance is good but there are periodic intervals of 0 ops/sec

Throughput of cache updates drops to zero
51
• Updates in RAM happen faster than on disk – disk needs to catch up
– Enable write throttling
• DataStorageConfiguration.writeThrottlingEnabled=true
• Sacrifice peak performance for stable latency and throughput
– Increase checkpoint page buffer size
• Allow more pending updates in RAM
• https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Sto
re+-+under+the+hood

Low throughput of SQL access
52
Symptom
• SQL reads using indexed fields are slow
– Even for simple queries like SELECT * FROM FOO WHERE ID = 123
• Writes to an SQL-enabled cache are slow
– Even via key-value API

Low throughput of SQL access
53
• A lot of possible causes – performance tuning is hard!
– https://apacheignite-sql.readme.io/docs/performance-and-debugging
• A common issue – indexes not fitting into maximum inline size
– Check for “Indexed columns of a row cannot be fully
inlined into index” warnings
– Do what the warning says
[2019-02-19T23:36:26,848][WARN ][main][H2TreeIndex] <CacheQueryExamplePersons> Indexed
columns of a row cannot be fully inlined into index what may lead to slowdown due to
additional data page reads, increase index inline size if needed (use INLINE_SIZE option
for CREATE INDEX command, QuerySqlField.inlineSize for annotated classes, or
QueryIndex.inlineSize for explicit QueryEntity configuration) [cacheName=PERSON,
tableName=CacheQueryExamplePersons, idxName=PERSON_FIRSTNAME_IDX, idxCols=(FIRSTNAME,
_KEY), idxType=SECONDARY, curSize=10, recommendedInlineSize=71]

Agenda
54
• Monitoring
– Logging
– JMX
• Troubleshooting
– Network
– Storage
– Performance
• Checklist

Checklist

Checklist
56
Be prepared
• Setup monitoring before something happens, not after!
– Make sure logs are safely stored
– GridGain Web Console is an easy and effective way to monitor
• Properly configure the cluster
– Implement the suggestions from this presentation before you encounter the
issues they help with
• Know the common problems and how to react
– Logs often provide solutions

Checklist
57
Avoiding common problems
• Network-related issues are often caused by IP configuration
– Configure IP finder – TcpDiscoveryVmIpFinder is usually a good fit
• GC pauses cause performance and stability issues
– Use G1 GC, set a fitting heap size
– Reduce GC pressure by using lazy SQL, withKeepBinary() and on-
heap cache
• Insufficient memory will result in a crash
– Plan storage capacity in advance
– Assess required heap size during testing
– Account for other processes in the system

Checklist
58
Avoiding common problems
• Implement cluster administration processes
– For in-memory: wait for rebalance when starting or stopping nodes
– For persistence: update baseline for long-term topology changes
• Write performance may be unstable when update speed is more than the
disk speed
– Enable write throttling
• Consider performance suggestions reported in the log
– Index inline size suggestion is an example

Questions?

Apache Ignite Resources
60
• Apache Ignite documentation
– https://apacheignite.readme.io/docs/
• Apache Ignite community resources
– user@ignite.apache.org – the mailing list
– https://ignite.apache.org/community/resources.html – other resources and
instructions
– http://apache-ignite-users.70518.x6.nabble.com/ – forum and archive
– https://stackoverflow.com/questions/tagged/ignite – StackOverflow questions
• Contact me!
– slukyanov@gridgain.com
– stanlukyanov@gmail.com

GridGain Resources
• White Papers
– Visit https://www.gridgain.com/resources/papers
• Webinars
– Visit https://www.gridgain.com/resources/webinars
• Videos
– Visit https://www.gridgain.com/resources/videos
• Free 30-Day Ultimate, Enterprise or Professional Edition Trial
– Visit https://www.gridgain.com/resources/download

Troubleshooting Apache® Ignite™

More Related Content

What's hot (20)

Similar to Troubleshooting Apache® Ignite™ (20)

More from Tom Diederich (12)

Recently uploaded (20)

Troubleshooting Apache® Ignite™