SlideShare a Scribd company logo
Filtering 100M objects

What can go wrong?

Alexey Ragozin
alexey.ragozin@gmail.com
Dec 2013
Problem description
•
•
•
•

100M object (50M was tested)
~ 100 fields per object
~ 1kb per object (ProtoBuf binary format)
Simple queries
select … where … order by … [limit N]

• Expected query result set – 200k
• Max query result set – 50% of all data
Problem description
•
•
•
•

100M object (50M was tested)
Object size is Ok
~ 100 fields per object
~ 1kb per object (ProtoBuf binary format)
Simple queries
select … where … order by … [limit N]

Inline with
• Expected query result set – 200k Coherence filters
• Max query result set – 50% of all data
Problem description
•
•
•
•

100M object (50M was tested)
~ 100 fields per object
~ 1kb per object (ProtoBuf binary format)
Simple queries
select … where … order by … [limit N]

• Expected query result set – 200k
Challenge
• Max query result set – 50% of all data
Problem description
•
•
•
•

100M object (50M was tested)
~ 100 fields per object
~ 1kb per object (ProtoBuf binary format)
Simple queries
select … where … order by … [limit N]

• Expected query result set – 200k
• Max query result set – 50% of all data
Challenge
Real challenge
Big result set problem
Calling NamedCache method
 Single TCMP message to each participating member
 Processing of on remote member
 Single TCMP result message from each member
 Aggregation all results in caller JVM
Return form method

Huge TCMP message = cluster crash
Naive strategy
Processing of query
 Send aggregator with filter, retrieve
 Keys
 Field for sorting

 Sort whole result set (keys + few fields)
 Apply limit
 Retrieve and send objects in fixed batches
Testing …
OutOfMemoryError on storage node
 Storage node processing filter
Deserialized
(600K objects per node)
value is there
Deserialize value, apply filter, match …
… retain entry until … (end of filtering ?)
 Filter processing may take few seconds
 There could be few concurrent queries
Solving …
Using indexes
 Index only filter does not deserialize object
 We cannot index everything
 Single unindexed predicate would call deserialization

Special filter to cut deserialized object reference
 We do not need object (aggregator extracts from binary)
 Desirialized object now collected in young space
 Synthetic wrapper object + messing with serialization
Testing …
Very high memory usage on service node
Collecting and sorting large result set
 Have to use huge young space (8Gib)
 Query concurrency in limited by memory
Single threaded sorting
 It is very fast tough
Indexes and attribute cardinality
“Status” attribute – 90% of objects are OPEN
5.69
5.78

No index

0.066
0.066

Indexed: ticker

0.01

Indexed: both

0.496

0.018
0.017

Indexed: both + noIndex hint

0

1

2

ticker & side

3

4

5

6

7

side & ticker

http://blog.ragozin.info/2013/07/coherence-101-filters-performance-and.html
Indexes and attribute cardinality
Possible strategies to remedy
 Transform query
 Wrap “bad” predicates into NoIndexFilter
 Fix filter execution “planner”
Indexes and attribute cardinality
Can we go without indexes?
 Full scan 50M – 80 cores, 3 servers
 30 seconds
 Too slow!
Naive strategy
Almost good enough
Problems with naive strategy
 Big memory problems on service process
 Max result set size is limited
 No control on max TCMP packet size
 Indexes may be inefficient
Incremental retrieval
 Result set is always sorted
 Primary key is always last sort attribute
 Aggregator on invocation
 Sort its partial result set
Selects first N
 Return N references (key + sort attribute)
 Return remaining size of each partition
Incremental retrieval

P1
P2
P3
P4

Sort order
Incremental retrieval

P1
P2
P3
P4

Sort order
Incremental retrieval

P1
P2
P3
P4

Sort order
Incremental retrieval

P1
P2

Partition excluded

P3
P4

Sort order
Incremental retrieval

P1
P2
P3
P4

Sort order
Incremental retrieval
Advantages
 Size of TCMP packets is under control
 Reduced traffic for LIMIT queries
 Fixed memory requirements for service node

Partial retrieval limit
 Target result set – 200k
 80 nodes
 Best performance with ~1500 limit
Incremental retrieval
A little nuance …
 Filter based aggregator is executed by one thread
 How many times aggregate(…) method would be called?
 Once
 Twice

 Once per partition
 Other

Coherence limits amount of data passed to aggregate(…)
based on binary side of data.
Incremental retrieval
What about snapshot consistency?
 There were no consistency to begin with
 No consistency between nodes
 Index updates are not transactional

But we need result set of query to be consistent!
 Hand made MVCC
 If you REALLY, REALLY, REALLY need it
Hand made MVCC
Synthetic key to have multiple versions in cache
Data affinity to exploit partition level consistency
Timestamp based surface – consistent snapshot

if timestamp is a part of key
IndexAwareFilter can be used (without an index)

otherwise
TimeSeriesIndex -

https://github.com/gridkit/coherence-search-timeseries
Time series index
Special index for managing versioned data
Entry key

Series key

Entry value

Entry id

Timestamp

Payload

Cache entry

Getting last version for series k
select * from versions where series=k and version =

(select max(version) from versions where key=k)
https://github.com/gridkit/coherence-search-timeseries
Time series index
Series
inverted index

HASHTABLE

Timestamp

ER

Series key

D
OR

Series key

Timestamp inverted subindex

Timestamp
Timestamp

Entry ref
Entry ref
Entry ref

Series key
Series key
Series key

https://github.com/gridkit/coherence-search-timeseries
TCMP vs TCP
•
•
•
•

TCP
WAN networks
Slow start
Sliding window
Timeout packet loss
detection

Fair network sharing

•
•
•
•

TCMP
Single switch networks
Fast NACKs
Loss detection by packet order
Per packet resend

Low latency communications
Bandwidth maximization
TCMP vs TCP
In bandwidth completion
TCP doesn’t have a chance against TCMP
Having TCP and TCMP in one network
 Normally TCMP is limited by proxy speaking TCP
 Traffic amplification effects (TCMP traffic >> TCP traffic)
 Bandwidth strangled TCP becomes unstable
 Hanging for few seconds (retransmit timeouts)
 Spurious connection resets

Keep TCMP in separate switch if possible!
http://blog.ragozin.info/2013/09/coherence-101-entryprocessor-traffic.html
Bonus: ProtoBuf extractor
Inspired by POF extractor







Extracts fields for binary data
Does not require generated classes or IDL
Use field IDs to navigate data
XPath like expressiveness (i.e. extract from map by key)
Can processes any number of extractors in single pass
Apache 2.0 licensed
https://github.com/gridkit/binary-extractors
Bonus: SJK diagnostic tool
SJK – CLI tool exploiting JVM diagnostic interfaces





Connect to JVM by PID
Display thread CPU usage in real time (like top)
Display per thread memory allocation rate
Dead objects histogram
… and more
https://github.com/aragozin/jvm-tools
Thank you
http://blog.ragozin.info
- my articles
http://code.google.com/p/gridkit
http://github.com/gridkit
- my open source code
http://aragozin.timepad.ru
- tech meetups in Moscow

Alexey Ragozin
alexey.ragozin@gmail.com

More Related Content

What's hot (20)

PDF
使用ZooKeeper打造軟體式負載平衡
Lawrence Huang
 
PPTX
Javantura v3 - Going Reactive with RxJava – Hrvoje Crnjak
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PDF
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
Lucidworks
 
PDF
Degrading Performance? You Might be Suffering From the Small Files Syndrome
Databricks
 
PPTX
Dive into spark2
Gal Marder
 
PDF
Getting Started with Apache Cassandra and Apache Zeppelin (DuyHai DOAN, DataS...
DataStax
 
PPTX
Distributed Applications with Apache Zookeeper
Alex Ehrnschwender
 
PPTX
Evolving Streaming Applications
DataWorks Summit
 
PDF
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Lucidworks
 
PDF
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
Flink Forward
 
PDF
Apache Flink internals
Kostas Tzoumas
 
PDF
Internship final report@Treasure Data Inc.
Ryuichi ITO
 
PDF
Presto At Treasure Data
Taro L. Saito
 
PPTX
Spark real world use cases and optimizations
Gal Marder
 
PPTX
Honest Performance Testing with "NDBench" (Vinay Chella, Netflix) | Cassandra...
DataStax
 
PPTX
JVM languages "flame wars"
Gal Marder
 
PDF
Object Detection with Transformers
Databricks
 
PPTX
So we're running Apache ZooKeeper. Now What? By Camille Fournier
Hakka Labs
 
PPT
Apache Apex & Apace Geode In-Memory Computation, Storage & Analysis
Apache Apex
 
PDF
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Databricks
 
使用ZooKeeper打造軟體式負載平衡
Lawrence Huang
 
Javantura v3 - Going Reactive with RxJava – Hrvoje Crnjak
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
Lucidworks
 
Degrading Performance? You Might be Suffering From the Small Files Syndrome
Databricks
 
Dive into spark2
Gal Marder
 
Getting Started with Apache Cassandra and Apache Zeppelin (DuyHai DOAN, DataS...
DataStax
 
Distributed Applications with Apache Zookeeper
Alex Ehrnschwender
 
Evolving Streaming Applications
DataWorks Summit
 
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Lucidworks
 
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
Flink Forward
 
Apache Flink internals
Kostas Tzoumas
 
Internship final report@Treasure Data Inc.
Ryuichi ITO
 
Presto At Treasure Data
Taro L. Saito
 
Spark real world use cases and optimizations
Gal Marder
 
Honest Performance Testing with "NDBench" (Vinay Chella, Netflix) | Cassandra...
DataStax
 
JVM languages "flame wars"
Gal Marder
 
Object Detection with Transformers
Databricks
 
So we're running Apache ZooKeeper. Now What? By Camille Fournier
Hakka Labs
 
Apache Apex & Apace Geode In-Memory Computation, Storage & Analysis
Apache Apex
 
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Databricks
 

Viewers also liked (6)

PDF
Garbage collection in JVM
aragozin
 
PDF
Java GC tuning and monitoring (by Alexander Ashitkin)
aragozin
 
PDF
Virtualizing Java in Java (jug.ru)
aragozin
 
PDF
Performance Test Driven Development (CEE SERC 2013 Moscow)
aragozin
 
PDF
JIT-компиляция в виртуальной машине Java (HighLoad++ 2013)
aragozin
 
PDF
Cборка мусора в Java без пауз (HighLoad++ 2013)
aragozin
 
Garbage collection in JVM
aragozin
 
Java GC tuning and monitoring (by Alexander Ashitkin)
aragozin
 
Virtualizing Java in Java (jug.ru)
aragozin
 
Performance Test Driven Development (CEE SERC 2013 Moscow)
aragozin
 
JIT-компиляция в виртуальной машине Java (HighLoad++ 2013)
aragozin
 
Cборка мусора в Java без пауз (HighLoad++ 2013)
aragozin
 
Ad

Similar to Filtering 100M objects in Coherence cache. What can go wrong? (20)

PPT
Coherence SIG: Advanced usage of indexes in coherence
aragozin
 
PPT
An Engineer's Intro to Oracle Coherence
Oracle
 
PPT
App Grid Dev With Coherence
James Bayer
 
PPT
App Grid Dev With Coherence
James Bayer
 
PPT
Application Grid Dev with Coherence
James Bayer
 
PDF
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
Michael Stack
 
PDF
Five major tips to maximize performance on a 200+ SQL HBase/Phoenix cluster
mas4share
 
PDF
Postgres can do THAT?
alexbrasetvik
 
PDF
In Memory Database In Action by Tanel Poder and Kerry Osborne
Enkitec
 
PDF
Oracle Database In-Memory Option in Action
Tanel Poder
 
PDF
Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce
thumbtacktech
 
PDF
Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiy
Anton Yazovskiy
 
PPTX
eHarmony @ Hbase Conference 2016 by vijay vangapandu.
Vijaykumar Vangapandu
 
PDF
Timeli: Believing Cassandra: Our Big-Data Journey To Enlightenment under the ...
DataStax Academy
 
PDF
PostgreSQL 9.6 Performance-Scalability Improvements
PGConf APAC
 
PDF
[Cassandra summit Tokyo, 2015] Cassandra 2015 最新情報 by ジョナサン・エリス(Jonathan Ellis)
datastaxjp
 
PPTX
Dun ddd
Lyuben Todorov
 
PDF
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
javier ramirez
 
PPTX
Wildcard13 - warmup slides for the "Roundtable discussion with Oracle Profess...
Maris Elsins
 
PPT
Siddhi CEP 2nd sideshow presentation
Sriskandarajah Suhothayan
 
Coherence SIG: Advanced usage of indexes in coherence
aragozin
 
An Engineer's Intro to Oracle Coherence
Oracle
 
App Grid Dev With Coherence
James Bayer
 
App Grid Dev With Coherence
James Bayer
 
Application Grid Dev with Coherence
James Bayer
 
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
Michael Stack
 
Five major tips to maximize performance on a 200+ SQL HBase/Phoenix cluster
mas4share
 
Postgres can do THAT?
alexbrasetvik
 
In Memory Database In Action by Tanel Poder and Kerry Osborne
Enkitec
 
Oracle Database In-Memory Option in Action
Tanel Poder
 
Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce
thumbtacktech
 
Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiy
Anton Yazovskiy
 
eHarmony @ Hbase Conference 2016 by vijay vangapandu.
Vijaykumar Vangapandu
 
Timeli: Believing Cassandra: Our Big-Data Journey To Enlightenment under the ...
DataStax Academy
 
PostgreSQL 9.6 Performance-Scalability Improvements
PGConf APAC
 
[Cassandra summit Tokyo, 2015] Cassandra 2015 最新情報 by ジョナサン・エリス(Jonathan Ellis)
datastaxjp
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
javier ramirez
 
Wildcard13 - warmup slides for the "Roundtable discussion with Oracle Profess...
Maris Elsins
 
Siddhi CEP 2nd sideshow presentation
Sriskandarajah Suhothayan
 
Ad

More from aragozin (20)

PDF
Java on Linux for devs and ops
aragozin
 
PDF
I know why your Java is slow
aragozin
 
PPTX
Java profiling Do It Yourself (jug.msk.ru 2016)
aragozin
 
PDF
Java black box profiling JUG.EKB 2016
aragozin
 
PDF
Распределённое нагрузочное тестирование на Java
aragozin
 
PDF
What every Java developer should know about network?
aragozin
 
PPTX
Java profiling Do It Yourself
aragozin
 
PPTX
DIY Java Profiler
aragozin
 
PPTX
Java black box profiling
aragozin
 
PDF
Блеск и нищета распределённых кэшей
aragozin
 
PDF
JIT compilation in modern platforms – challenges and solutions
aragozin
 
PDF
Casual mass parallel computing
aragozin
 
PPTX
Nanocloud cloud scale jvm
aragozin
 
PPTX
Борьба с GС паузами в JVM
aragozin
 
PPTX
Распределённый кэш или хранилище данных. Что выбрать?
aragozin
 
PPTX
Devirtualization of method calls
aragozin
 
PPTX
Tech talk network - friend or foe
aragozin
 
PDF
Database backed coherence cache
aragozin
 
PDF
ORM and distributed caching
aragozin
 
PDF
Секреты сборки мусора в Java [DUMP-IT 2012]
aragozin
 
Java on Linux for devs and ops
aragozin
 
I know why your Java is slow
aragozin
 
Java profiling Do It Yourself (jug.msk.ru 2016)
aragozin
 
Java black box profiling JUG.EKB 2016
aragozin
 
Распределённое нагрузочное тестирование на Java
aragozin
 
What every Java developer should know about network?
aragozin
 
Java profiling Do It Yourself
aragozin
 
DIY Java Profiler
aragozin
 
Java black box profiling
aragozin
 
Блеск и нищета распределённых кэшей
aragozin
 
JIT compilation in modern platforms – challenges and solutions
aragozin
 
Casual mass parallel computing
aragozin
 
Nanocloud cloud scale jvm
aragozin
 
Борьба с GС паузами в JVM
aragozin
 
Распределённый кэш или хранилище данных. Что выбрать?
aragozin
 
Devirtualization of method calls
aragozin
 
Tech talk network - friend or foe
aragozin
 
Database backed coherence cache
aragozin
 
ORM and distributed caching
aragozin
 
Секреты сборки мусора в Java [DUMP-IT 2012]
aragozin
 

Recently uploaded (20)

PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Biography of Daniel Podor.pdf
Daniel Podor
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 

Filtering 100M objects in Coherence cache. What can go wrong?

  • 1. Filtering 100M objects What can go wrong? Alexey Ragozin [email protected] Dec 2013
  • 2. Problem description • • • • 100M object (50M was tested) ~ 100 fields per object ~ 1kb per object (ProtoBuf binary format) Simple queries select … where … order by … [limit N] • Expected query result set – 200k • Max query result set – 50% of all data
  • 3. Problem description • • • • 100M object (50M was tested) Object size is Ok ~ 100 fields per object ~ 1kb per object (ProtoBuf binary format) Simple queries select … where … order by … [limit N] Inline with • Expected query result set – 200k Coherence filters • Max query result set – 50% of all data
  • 4. Problem description • • • • 100M object (50M was tested) ~ 100 fields per object ~ 1kb per object (ProtoBuf binary format) Simple queries select … where … order by … [limit N] • Expected query result set – 200k Challenge • Max query result set – 50% of all data
  • 5. Problem description • • • • 100M object (50M was tested) ~ 100 fields per object ~ 1kb per object (ProtoBuf binary format) Simple queries select … where … order by … [limit N] • Expected query result set – 200k • Max query result set – 50% of all data Challenge Real challenge
  • 6. Big result set problem Calling NamedCache method  Single TCMP message to each participating member  Processing of on remote member  Single TCMP result message from each member  Aggregation all results in caller JVM Return form method Huge TCMP message = cluster crash
  • 7. Naive strategy Processing of query  Send aggregator with filter, retrieve  Keys  Field for sorting  Sort whole result set (keys + few fields)  Apply limit  Retrieve and send objects in fixed batches
  • 8. Testing … OutOfMemoryError on storage node  Storage node processing filter Deserialized (600K objects per node) value is there Deserialize value, apply filter, match … … retain entry until … (end of filtering ?)  Filter processing may take few seconds  There could be few concurrent queries
  • 9. Solving … Using indexes  Index only filter does not deserialize object  We cannot index everything  Single unindexed predicate would call deserialization Special filter to cut deserialized object reference  We do not need object (aggregator extracts from binary)  Desirialized object now collected in young space  Synthetic wrapper object + messing with serialization
  • 10. Testing … Very high memory usage on service node Collecting and sorting large result set  Have to use huge young space (8Gib)  Query concurrency in limited by memory Single threaded sorting  It is very fast tough
  • 11. Indexes and attribute cardinality “Status” attribute – 90% of objects are OPEN 5.69 5.78 No index 0.066 0.066 Indexed: ticker 0.01 Indexed: both 0.496 0.018 0.017 Indexed: both + noIndex hint 0 1 2 ticker & side 3 4 5 6 7 side & ticker http://blog.ragozin.info/2013/07/coherence-101-filters-performance-and.html
  • 12. Indexes and attribute cardinality Possible strategies to remedy  Transform query  Wrap “bad” predicates into NoIndexFilter  Fix filter execution “planner”
  • 13. Indexes and attribute cardinality Can we go without indexes?  Full scan 50M – 80 cores, 3 servers  30 seconds  Too slow!
  • 14. Naive strategy Almost good enough Problems with naive strategy  Big memory problems on service process  Max result set size is limited  No control on max TCMP packet size  Indexes may be inefficient
  • 15. Incremental retrieval  Result set is always sorted  Primary key is always last sort attribute  Aggregator on invocation  Sort its partial result set Selects first N  Return N references (key + sort attribute)  Return remaining size of each partition
  • 21. Incremental retrieval Advantages  Size of TCMP packets is under control  Reduced traffic for LIMIT queries  Fixed memory requirements for service node Partial retrieval limit  Target result set – 200k  80 nodes  Best performance with ~1500 limit
  • 22. Incremental retrieval A little nuance …  Filter based aggregator is executed by one thread  How many times aggregate(…) method would be called?  Once  Twice  Once per partition  Other Coherence limits amount of data passed to aggregate(…) based on binary side of data.
  • 23. Incremental retrieval What about snapshot consistency?  There were no consistency to begin with  No consistency between nodes  Index updates are not transactional But we need result set of query to be consistent!  Hand made MVCC  If you REALLY, REALLY, REALLY need it
  • 24. Hand made MVCC Synthetic key to have multiple versions in cache Data affinity to exploit partition level consistency Timestamp based surface – consistent snapshot if timestamp is a part of key IndexAwareFilter can be used (without an index) otherwise TimeSeriesIndex - https://github.com/gridkit/coherence-search-timeseries
  • 25. Time series index Special index for managing versioned data Entry key Series key Entry value Entry id Timestamp Payload Cache entry Getting last version for series k select * from versions where series=k and version = (select max(version) from versions where key=k) https://github.com/gridkit/coherence-search-timeseries
  • 26. Time series index Series inverted index HASHTABLE Timestamp ER Series key D OR Series key Timestamp inverted subindex Timestamp Timestamp Entry ref Entry ref Entry ref Series key Series key Series key https://github.com/gridkit/coherence-search-timeseries
  • 27. TCMP vs TCP • • • • TCP WAN networks Slow start Sliding window Timeout packet loss detection Fair network sharing • • • • TCMP Single switch networks Fast NACKs Loss detection by packet order Per packet resend Low latency communications Bandwidth maximization
  • 28. TCMP vs TCP In bandwidth completion TCP doesn’t have a chance against TCMP Having TCP and TCMP in one network  Normally TCMP is limited by proxy speaking TCP  Traffic amplification effects (TCMP traffic >> TCP traffic)  Bandwidth strangled TCP becomes unstable  Hanging for few seconds (retransmit timeouts)  Spurious connection resets Keep TCMP in separate switch if possible! http://blog.ragozin.info/2013/09/coherence-101-entryprocessor-traffic.html
  • 29. Bonus: ProtoBuf extractor Inspired by POF extractor       Extracts fields for binary data Does not require generated classes or IDL Use field IDs to navigate data XPath like expressiveness (i.e. extract from map by key) Can processes any number of extractors in single pass Apache 2.0 licensed https://github.com/gridkit/binary-extractors
  • 30. Bonus: SJK diagnostic tool SJK – CLI tool exploiting JVM diagnostic interfaces     Connect to JVM by PID Display thread CPU usage in real time (like top) Display per thread memory allocation rate Dead objects histogram … and more https://github.com/aragozin/jvm-tools
  • 31. Thank you http://blog.ragozin.info - my articles http://code.google.com/p/gridkit http://github.com/gridkit - my open source code http://aragozin.timepad.ru - tech meetups in Moscow Alexey Ragozin [email protected]