SlideShare a Scribd company logo
1
Hybrid Transactional/Analytics Processing
with Spark and In-Memory Data Grids
Copyright © GigaSpaces 2017. All rights reserved.
Ali Hodroj
VP, Products and Strategy @ahodroj
2
GigaSpaces
Ultra-Low Latency / High Throughput Middleware
Direct customers
500+
Headquarters
New York, NY
Established
2001
3
HERE
How we got
4
We’re seeing more in our customer base
5
…a shift towards real-time
BI
Big
Data
Fast
Data
6
Sample Customer Use Cases
Internet of Things Omni-Channel Operational
Intelligence
Operational
Analytics
Predictive
Analytics
Fraud Detection, Supply
chain optimization
Personalization,
Recommendation
Edge
Analytics
Operational Intelligence,
Predictive Maintenance,
Spatial Analytics
7
In-Memory Computing
(not a new thing)
Rapid decline in RAM prices lead to advanced data processing
innovations
drives
• Transactional (2001-present)
– In-Memory Databases
– In-Memory Data Grids
• Analytics (2012-present)
– In-Memory Data Processing
Frameworks (Spark)
– In-Memory File Systems (Tachyon)
8
In-Memory Data Processing: Apache Spark
99
Data Grid is a cluster of
machines that work
together to create a
resilient shared data
fabric for low-latency
data access and extreme
transaction processing
In-Memory Data Grid:
Online Transaction Processing at Low-Latency and High Throughput
http://xap.github.io
10
In-Memory Data Grid 101
Feeder
Virtual Machine Virtual MachineVirtual Machine
Partitioned Data
11
Write
Event-Driven / Reactive
In-Memory Data Grid 101: Execution Models
RPC / Master-Worker
12
Write
Event-Driven / Reactive
In-Memory Data Grid 101: Execution Models
RPC / Master-Worker
13
In-Memory Data Grid 101: Typical Deployment
HTML
HTTP/S
HW LB
REST
HTTP/
S
REST
HTTP/S
LB
Agen
t
GSA
HTTPD
Load
Balanc
er
LB
Agen
t
GSA
HTTPD
Load
Balanc
er
Mirror
Service
GSA
DB
Private or Public Cloud
Processing Processing Processing
Processing Processing
Processi
ng
Processing Processing Processing
Processing Processing Processing
Primary Set 1 Primary Set 2 Primary Set 3
Primary Set 4 Primary Set 5 Primary Set 6
Backup Set 6Backup Set 5Backup Set 4
Backup Set 1 Backup Set 2 Backup Set 3
GSA GSA GSA
GSA GSA GSA
Async
)
14
Host Cisco UCS Server
CPU Intel 16core 2.9GHz
Concurrent Threads 2
Throughput 200, 400, 800 ops/sec
15
16
Hybrid Transactional/Analytics Processing at Scale
Provide closed-loop analytics pipeline. Data,
insight, to action at sub-second latency
IoT and Omni-channel require the
convergence of many different data
types
Blend of both real-time and historical
data
Requirements
1
Bi-directional integration between
transactional and analytical data stores
Ability to support POJO, JSON,
GeoSpatial, and Unstructured types
through a unified API
Unified and scale-out real-time
and historical data store
Challenges
2
3
17
HTAP:
SPARK + MICROSERVICES
Our road towards
18
What’s needed
Large-scale distributed
analytics framework
Unified, scale-out, low-latency data store
Transactional capabilities:
ACID, Event-Driven, Rich
Data modeling
Microservices
19
Our approach to HTAP
Low-latency Scale-Out
In-Memory Data Grid
Large-scale distributed
analytics framework
+
20
• Unified & Concise API
• Highly Flexible Data Store
Integration
• Massive Community and Adoption
Why Spark?
21
1
Bi-directional integration between
transactional and analytical data stores
Provide closed-loop analytics pipeline. Data, insight, to action
at scale (at sub-seconds)
22
23
In-Memory Data Grid
In-Memory Store(RAM) Flash, SSD, Off-Heap Store
Spark Spark SQL
Spark
Steaming
Machine Learning
Highavailability
Security&Management
Transactional Tier
ACID-compliant
Strong Consistency
Analytics Tier
24
• Get Partitions: An array of partitions
that a dataset is divided to
• Compute: A compute function to do a
computation on partitions
• Get Preferred Location: Optional
preferred locations, i.e. hosts for a
partition where the data will be loaded
• IMDG Distributed Query to get partitions
and their hosts
• Iterator over portion of data
• Hosts from Distributed Query
Build a connector: Spark to IMDG
25
node 1
Spark master
Grid
master
node 2
Spark worker
Grid
Partition
node 3
Spark worker
Grid
Partition
NoSQL Storage
Pattern #1: Data Locality (machine-level)
26
Aggregation in
Spark
Filtering and
columns pruning
in Data Grid
SELECT SUM(amount)
FROM order
WHERE city = ‘NY’ AND year > 2012
Spark SQL architecture:
• Pushing down predicates to Data Grid
• Leveraging indexes
• Transparent to user
• Enabling support for other languages -
Python/R
Implementing DataSource API
Pattern #2: Pushdown Predicates (Grid-side processing)
27
node 1
Spark master
Grid
master
node 2
Spark worker
Grid
Partition
node 3
Spark worker
Grid
Partition
Lightweight
workers,
small JVMs
Large JVMs,
Fast
indexing
NoSQL Storage
Pattern #3: Decouple Data Processing from Data Storage
28
Push-down
Predicates
performance
Traditional Spark filtering of 7MM records
Grid-side + Spark filtering of 7MM records
31
sec
800
ms
vs
29
Ability to support POJO, JSON, GeoSpatial, and
Unstructured types through a unified API
2
IoT and Omni-channel require the convergence of many
different data types
In-Memory Data Grid + Spark Convergence
Geo-Spatial Full Text
Simple K/V to RDD Mapping
POJO Domain Model to Spark
POJO Domain Model to Spark (Event-Driven)
JSON Domain Model to Spark
Geo-Spatial Data Frames
Geo-Spatial
Full Text Indexes + Lucene Analyzers
Full Text
37
Unified and scale-out real-time and historical
data store
3
Blend of both real-time and historical data
38
hash(key) % #nodes
In-Memory Data Grid Partitioning
39
hash(key) % #nodes
In-Memory Data Grid Partitioning – With HA
40
node 1
Spark executor
Spark
Partition
#1
Grid
Partition #1
Direct
connection
Simple, but
not enough
parallelism
for Spark
node 2
Spark executor
Spark
Partition
#2
Grid
Partition #2
node 3
Spark executor
Spark
Partition
#3
Grid
Partition #3
Spark to Data Grid Partition Cardinality
41
node 1
Spark Executor
Grid Primary #1
0
.
.
1
.
.
2
.
.
3
.
.
4
.
.
5
.
.
.
.
.
.
.
.
.
.
.
.
Spark
Partition #1
1023
1 Spark partition = M grid buckets
1 Grid partition = N Spark partitions
Spark
Partition #2
Spark
Partition #1
Pattern #4: Grid bucketing for higher throughput
42
Eventually, we productized this as
an open source Spark distribution
@InsightEdgeIO http://insightedge.io
Apache 2 License
http://insightedge.io/docs
http://insightedge.io/blog
http://github.com/InsightEdge
GigaSpaces InsightEdge
http://insightedge.io
High Performance Spark with OLTP Capabilities
upcoming: Spark RDD/DF native read/save on Off-Heap
(SSD/Flash/Direct Buffers)
Application
Processi
ng
Primary
instance
s
Backup
instance
s
Sync
Replicati
on
Storage
Array
Storage
Array
In Memory Data Grid
Spark worker Spark worker
• Significant RAM TCO reduction
in Spark clusters
• Direct RDD/DataFrame read
write from SSD/Flash device
• Avoid Filesystem hops and
write amplification
46
REFERENCE
ARCHITECTURES
47
In-Process HTAP
48
In-Memory Data Grid
Realtime Replication
• Scoring models
• Trigger actions
• Events
Transactions Analytics
XAP + InsightEdge deployed on
different grid clusters with bi-
directional real-time data replication
Point-of-Decision HTAP
4949
Challenge
• Stream data from 1,000s of Taxis
• Actively monitor and generate real-time notifications
• Real-time Route Optimization and Geo-Fencing
Solution
• Leverage unified in-memory data fabric as middleware for
geo-spatial analytics
• Elastically scale stream processing and transactional apps
together
• Location-based tracking, Geo-fencing
Edge components
Data Sources
Transportation / IoT: Connected Cars / Fleet Geo-Analytics
50
THANK YOU!
QUESTIONS?

More Related Content

What's hot (20)

PPTX
Pentaho Analytics on MongoDB
Mark Kromer
 
PDF
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Big Data Spain
 
PDF
VP of WW Partners by Alan Chhabra
Big Data Spain
 
PDF
Snowflakes in the Cloud Real world experience on a new approach for Big Data
DevFest DC
 
PDF
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dataconomy Media
 
PPT
Counting Unique Users in Real-Time: Here's a Challenge for You!
DataWorks Summit
 
PDF
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
TigerGraph
 
PPTX
Building big data solutions on azure
Eyal Ben Ivri
 
PDF
Advanced data science algorithms applied to scalable stream processing by Dav...
Big Data Spain
 
PPTX
Pouring the Foundation: Data Management in the Energy Industry
DataWorks Summit
 
PDF
The Rise of Engineering-Driven Analytics by Loren Shure
Big Data Spain
 
PPTX
Intuit Analytics Cloud 101
DataWorks Summit/Hadoop Summit
 
PDF
Cortana Analytics Workshop: Operationalizing Your End-to-End Analytics Solution
MSAdvAnalytics
 
PDF
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Neo4j
 
PPTX
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
Infochimps, a CSC Big Data Business
 
PDF
Smart data for a predictive bank
DataWorks Summit/Hadoop Summit
 
PDF
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
Mark Rittman
 
PDF
Bigdata Machine Learning Platform
Mk Kim
 
PPTX
Optimizing industrial operations using the big data ecosystem
DataWorks Summit
 
PDF
Unlocking Geospatial Analytics Use Cases with CARTO and Databricks
Databricks
 
Pentaho Analytics on MongoDB
Mark Kromer
 
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Big Data Spain
 
VP of WW Partners by Alan Chhabra
Big Data Spain
 
Snowflakes in the Cloud Real world experience on a new approach for Big Data
DevFest DC
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dataconomy Media
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
DataWorks Summit
 
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
TigerGraph
 
Building big data solutions on azure
Eyal Ben Ivri
 
Advanced data science algorithms applied to scalable stream processing by Dav...
Big Data Spain
 
Pouring the Foundation: Data Management in the Energy Industry
DataWorks Summit
 
The Rise of Engineering-Driven Analytics by Loren Shure
Big Data Spain
 
Intuit Analytics Cloud 101
DataWorks Summit/Hadoop Summit
 
Cortana Analytics Workshop: Operationalizing Your End-to-End Analytics Solution
MSAdvAnalytics
 
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Neo4j
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
Infochimps, a CSC Big Data Business
 
Smart data for a predictive bank
DataWorks Summit/Hadoop Summit
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
Mark Rittman
 
Bigdata Machine Learning Platform
Mk Kim
 
Optimizing industrial operations using the big data ecosystem
DataWorks Summit
 
Unlocking Geospatial Analytics Use Cases with CARTO and Databricks
Databricks
 

Viewers also liked (15)

PPTX
Application-level Disaster Recovery on OpenStack
Ali Hodroj
 
PDF
6 GigaSpaces Principles to Survive Black Friday
Ali Hodroj
 
PPTX
E-Commerce and In-Memory Computing: Crossing the Scalability Chasm
Ali Hodroj
 
PDF
RDMA on ARM
inside-BigData.com
 
PDF
Exascale Computing Project - Driving a HUGE Change in a Changing World
inside-BigData.com
 
PPTX
Cloudifying High Availability: The Case for Elastic Disaster Recovery
Ali Hodroj
 
PDF
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
inside-BigData.com
 
PDF
Building a REST Job Server for interactive Spark as a service by Romain Rigau...
Spark Summit
 
PDF
Dynamic Resource Allocation Spark on YARN
Tsuyoshi OZAWA
 
PPTX
Introduction to ML with Apache Spark MLlib
Taras Matyashovsky
 
PDF
State of the OpenFabrics Alliance
inside-BigData.com
 
PDF
Data Source API in Spark
Databricks
 
PDF
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
inside-BigData.com
 
PPTX
How Data Drives Business at Choice Hotels
Cloudera, Inc.
 
PDF
3 Things Every Sales Team Needs to Be Thinking About in 2017
Drift
 
Application-level Disaster Recovery on OpenStack
Ali Hodroj
 
6 GigaSpaces Principles to Survive Black Friday
Ali Hodroj
 
E-Commerce and In-Memory Computing: Crossing the Scalability Chasm
Ali Hodroj
 
RDMA on ARM
inside-BigData.com
 
Exascale Computing Project - Driving a HUGE Change in a Changing World
inside-BigData.com
 
Cloudifying High Availability: The Case for Elastic Disaster Recovery
Ali Hodroj
 
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
inside-BigData.com
 
Building a REST Job Server for interactive Spark as a service by Romain Rigau...
Spark Summit
 
Dynamic Resource Allocation Spark on YARN
Tsuyoshi OZAWA
 
Introduction to ML with Apache Spark MLlib
Taras Matyashovsky
 
State of the OpenFabrics Alliance
inside-BigData.com
 
Data Source API in Spark
Databricks
 
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
inside-BigData.com
 
How Data Drives Business at Choice Hotels
Cloudera, Inc.
 
3 Things Every Sales Team Needs to Be Thinking About in 2017
Drift
 
Ad

Similar to Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data Grids (20)

PDF
SnappyData Toronto Meetup Nov 2017
SnappyData
 
PDF
Scaling up with Cisco Big Data: Data + Science = Data Science
eRic Choo
 
PDF
Pivotal Real Time Data Stream Analytics
kgshukla
 
PDF
High Performance Spatial-Temporal Trajectory Analysis with Spark
DataWorks Summit/Hadoop Summit
 
PDF
SnappyData @ Seattle Spark Meetup
SnappyData
 
PDF
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
Denodo
 
PDF
Analytics&IoT
Selvaraj Kesavan
 
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
PDF
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
PDF
Red Hat Storage: Emerging Use Cases
Red_Hat_Storage
 
PDF
Webinar Data Mesh - Part 3
Jeffrey T. Pollock
 
PPT
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Rizaldy Ignacio
 
PDF
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
Iulia Emanuela Iancuta
 
PDF
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
Denodo
 
PDF
Stsg17 speaker yousunjeong
Yousun Jeong
 
PPTX
Operational Analytics Using Spark and NoSQL Data Stores
DATAVERSITY
 
PDF
Fom io t_to_bigdata_step_by_step-final
Luis Filipe Silva
 
PDF
Data Pipelines with Spark & DataStax Enterprise
DataStax
 
PDF
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 
PDF
Real-Time Analytics with Confluent and MemSQL
SingleStore
 
SnappyData Toronto Meetup Nov 2017
SnappyData
 
Scaling up with Cisco Big Data: Data + Science = Data Science
eRic Choo
 
Pivotal Real Time Data Stream Analytics
kgshukla
 
High Performance Spatial-Temporal Trajectory Analysis with Spark
DataWorks Summit/Hadoop Summit
 
SnappyData @ Seattle Spark Meetup
SnappyData
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
Denodo
 
Analytics&IoT
Selvaraj Kesavan
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
Red Hat Storage: Emerging Use Cases
Red_Hat_Storage
 
Webinar Data Mesh - Part 3
Jeffrey T. Pollock
 
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Rizaldy Ignacio
 
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
Iulia Emanuela Iancuta
 
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
Denodo
 
Stsg17 speaker yousunjeong
Yousun Jeong
 
Operational Analytics Using Spark and NoSQL Data Stores
DATAVERSITY
 
Fom io t_to_bigdata_step_by_step-final
Luis Filipe Silva
 
Data Pipelines with Spark & DataStax Enterprise
DataStax
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 
Real-Time Analytics with Confluent and MemSQL
SingleStore
 
Ad

Recently uploaded (20)

PPTX
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
PDF
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
PPTX
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
PDF
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
PPTX
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
PDF
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
PPTX
Home Care Tools: Benefits, features and more
Third Rock Techkno
 
PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PPTX
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
PPTX
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
PPTX
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PDF
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
PPTX
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 
PDF
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
PPTX
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PDF
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
Home Care Tools: Benefits, features and more
Third Rock Techkno
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 

Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data Grids

  • 1. 1 Hybrid Transactional/Analytics Processing with Spark and In-Memory Data Grids Copyright © GigaSpaces 2017. All rights reserved. Ali Hodroj VP, Products and Strategy @ahodroj
  • 2. 2 GigaSpaces Ultra-Low Latency / High Throughput Middleware Direct customers 500+ Headquarters New York, NY Established 2001
  • 4. 4 We’re seeing more in our customer base
  • 5. 5 …a shift towards real-time BI Big Data Fast Data
  • 6. 6 Sample Customer Use Cases Internet of Things Omni-Channel Operational Intelligence Operational Analytics Predictive Analytics Fraud Detection, Supply chain optimization Personalization, Recommendation Edge Analytics Operational Intelligence, Predictive Maintenance, Spatial Analytics
  • 7. 7 In-Memory Computing (not a new thing) Rapid decline in RAM prices lead to advanced data processing innovations drives • Transactional (2001-present) – In-Memory Databases – In-Memory Data Grids • Analytics (2012-present) – In-Memory Data Processing Frameworks (Spark) – In-Memory File Systems (Tachyon)
  • 9. 99 Data Grid is a cluster of machines that work together to create a resilient shared data fabric for low-latency data access and extreme transaction processing In-Memory Data Grid: Online Transaction Processing at Low-Latency and High Throughput http://xap.github.io
  • 10. 10 In-Memory Data Grid 101 Feeder Virtual Machine Virtual MachineVirtual Machine Partitioned Data
  • 11. 11 Write Event-Driven / Reactive In-Memory Data Grid 101: Execution Models RPC / Master-Worker
  • 12. 12 Write Event-Driven / Reactive In-Memory Data Grid 101: Execution Models RPC / Master-Worker
  • 13. 13 In-Memory Data Grid 101: Typical Deployment HTML HTTP/S HW LB REST HTTP/ S REST HTTP/S LB Agen t GSA HTTPD Load Balanc er LB Agen t GSA HTTPD Load Balanc er Mirror Service GSA DB Private or Public Cloud Processing Processing Processing Processing Processing Processi ng Processing Processing Processing Processing Processing Processing Primary Set 1 Primary Set 2 Primary Set 3 Primary Set 4 Primary Set 5 Primary Set 6 Backup Set 6Backup Set 5Backup Set 4 Backup Set 1 Backup Set 2 Backup Set 3 GSA GSA GSA GSA GSA GSA Async )
  • 14. 14 Host Cisco UCS Server CPU Intel 16core 2.9GHz Concurrent Threads 2 Throughput 200, 400, 800 ops/sec
  • 15. 15
  • 16. 16 Hybrid Transactional/Analytics Processing at Scale Provide closed-loop analytics pipeline. Data, insight, to action at sub-second latency IoT and Omni-channel require the convergence of many different data types Blend of both real-time and historical data Requirements 1 Bi-directional integration between transactional and analytical data stores Ability to support POJO, JSON, GeoSpatial, and Unstructured types through a unified API Unified and scale-out real-time and historical data store Challenges 2 3
  • 18. 18 What’s needed Large-scale distributed analytics framework Unified, scale-out, low-latency data store Transactional capabilities: ACID, Event-Driven, Rich Data modeling Microservices
  • 19. 19 Our approach to HTAP Low-latency Scale-Out In-Memory Data Grid Large-scale distributed analytics framework +
  • 20. 20 • Unified & Concise API • Highly Flexible Data Store Integration • Massive Community and Adoption Why Spark?
  • 21. 21 1 Bi-directional integration between transactional and analytical data stores Provide closed-loop analytics pipeline. Data, insight, to action at scale (at sub-seconds)
  • 22. 22
  • 23. 23 In-Memory Data Grid In-Memory Store(RAM) Flash, SSD, Off-Heap Store Spark Spark SQL Spark Steaming Machine Learning Highavailability Security&Management Transactional Tier ACID-compliant Strong Consistency Analytics Tier
  • 24. 24 • Get Partitions: An array of partitions that a dataset is divided to • Compute: A compute function to do a computation on partitions • Get Preferred Location: Optional preferred locations, i.e. hosts for a partition where the data will be loaded • IMDG Distributed Query to get partitions and their hosts • Iterator over portion of data • Hosts from Distributed Query Build a connector: Spark to IMDG
  • 25. 25 node 1 Spark master Grid master node 2 Spark worker Grid Partition node 3 Spark worker Grid Partition NoSQL Storage Pattern #1: Data Locality (machine-level)
  • 26. 26 Aggregation in Spark Filtering and columns pruning in Data Grid SELECT SUM(amount) FROM order WHERE city = ‘NY’ AND year > 2012 Spark SQL architecture: • Pushing down predicates to Data Grid • Leveraging indexes • Transparent to user • Enabling support for other languages - Python/R Implementing DataSource API Pattern #2: Pushdown Predicates (Grid-side processing)
  • 27. 27 node 1 Spark master Grid master node 2 Spark worker Grid Partition node 3 Spark worker Grid Partition Lightweight workers, small JVMs Large JVMs, Fast indexing NoSQL Storage Pattern #3: Decouple Data Processing from Data Storage
  • 28. 28 Push-down Predicates performance Traditional Spark filtering of 7MM records Grid-side + Spark filtering of 7MM records 31 sec 800 ms vs
  • 29. 29 Ability to support POJO, JSON, GeoSpatial, and Unstructured types through a unified API 2 IoT and Omni-channel require the convergence of many different data types
  • 30. In-Memory Data Grid + Spark Convergence Geo-Spatial Full Text
  • 31. Simple K/V to RDD Mapping
  • 32. POJO Domain Model to Spark
  • 33. POJO Domain Model to Spark (Event-Driven)
  • 34. JSON Domain Model to Spark
  • 36. Full Text Indexes + Lucene Analyzers Full Text
  • 37. 37 Unified and scale-out real-time and historical data store 3 Blend of both real-time and historical data
  • 38. 38 hash(key) % #nodes In-Memory Data Grid Partitioning
  • 39. 39 hash(key) % #nodes In-Memory Data Grid Partitioning – With HA
  • 40. 40 node 1 Spark executor Spark Partition #1 Grid Partition #1 Direct connection Simple, but not enough parallelism for Spark node 2 Spark executor Spark Partition #2 Grid Partition #2 node 3 Spark executor Spark Partition #3 Grid Partition #3 Spark to Data Grid Partition Cardinality
  • 41. 41 node 1 Spark Executor Grid Primary #1 0 . . 1 . . 2 . . 3 . . 4 . . 5 . . . . . . . . . . . . Spark Partition #1 1023 1 Spark partition = M grid buckets 1 Grid partition = N Spark partitions Spark Partition #2 Spark Partition #1 Pattern #4: Grid bucketing for higher throughput
  • 42. 42 Eventually, we productized this as an open source Spark distribution
  • 43. @InsightEdgeIO http://insightedge.io Apache 2 License http://insightedge.io/docs http://insightedge.io/blog http://github.com/InsightEdge
  • 45. upcoming: Spark RDD/DF native read/save on Off-Heap (SSD/Flash/Direct Buffers) Application Processi ng Primary instance s Backup instance s Sync Replicati on Storage Array Storage Array In Memory Data Grid Spark worker Spark worker • Significant RAM TCO reduction in Spark clusters • Direct RDD/DataFrame read write from SSD/Flash device • Avoid Filesystem hops and write amplification
  • 48. 48 In-Memory Data Grid Realtime Replication • Scoring models • Trigger actions • Events Transactions Analytics XAP + InsightEdge deployed on different grid clusters with bi- directional real-time data replication Point-of-Decision HTAP
  • 49. 4949 Challenge • Stream data from 1,000s of Taxis • Actively monitor and generate real-time notifications • Real-time Route Optimization and Geo-Fencing Solution • Leverage unified in-memory data fabric as middleware for geo-spatial analytics • Elastically scale stream processing and transactional apps together • Location-based tracking, Geo-fencing Edge components Data Sources Transportation / IoT: Connected Cars / Fleet Geo-Analytics