SlideShare a Scribd company logo
© 2018 GridGain Systems, Inc.
Improving Apache Spark™ In-Memory
Computing with Apache Ignite™
Valentin Kulichenko
GridGain Systems
© 2018 GridGain Systems, Inc.
a memory-centric distributed
database, caching, and processing platform
for transactional, analytical, and streaming workloads,
delivering in-memory speeds at petabyte scale
© 2018 GridGain Systems, Inc.
Apache Ignite Database and Caching Platform
Memory-Centric Storage
Ignite Native Persistence
(Flash, SSD, Intel 3D XPoint)
Third-Party Persistence
(RDBMS, HDFS, NoSQL)
SQL Transactions Compute Services MLStreamingKey/Value
IoTFinancial
Services
Pharma &
Healthcare
E-CommerceTravel &
Logistics
Telco
© 2018 GridGain Systems, Inc.
• Distributed memory-centric database • Ingests data from HDFS or another
storage
• Fully fledged compute platform: SQL,
transactions, key-value, collocated
processing, ML/DL
• Streaming and compute engine
• OLAP and OLTP • Inclined towards OLAP and focused on
MR payloads
Comparing Ignite and Spark
© 2018 GridGain Systems, Inc.
Ignite is a memory-centric store for Spark
• No data movement from Ignite to Spark
• In-place query execution
• Boost DataFrame and SQL performance
• Share state and data among Spark jobs
• Faster data and streaming analytics
Ignite and Spark Together
+
© 2018 GridGain Systems, Inc.
Ignite and Spark Integration
Spark Application
Spark Worker
Spark
Job
Spark
Job
Yarn Mesos Docker HDFS
Spark Worker
Spark
Job
Spark
Job
Spark Worker
Spark
Job
Spark
Job
In-Memory Shared RDD or DataFrame
GridGain Node GridGain Node GridGain Node
Share state and
data among
Spark jobs
No data
movement
Boost DataFrame
and SQL
Performance
SQL on top
of RDDs
In-place query
execution
© 2018 GridGain Systems, Inc.
• Spark RDD abstraction
• Shared view over Ignite cache/table
• Mutable
• Ignite SQL on top of RDDs APIs
• Indexes and in-place execution
Ignite Shared RDDs
© 2018 GridGain Systems, Inc.
• Standard RDD APIs + Ignite SQL
• No rip-and-replace
• Switch to Ignite as a storage
Write to and Read from Ignite
val sharedRDD: IgniteRDD[int, int] = ic.fromCache(”sharedRDD")
val greaterThanFiftyThousand = sharedRDD.filter(_._2 > 50000)
val df = sharedRDD.sql(”select _val from Integer where _key > 50000”)
val sharedRDD: IgniteRDD[int, int] = ic.fromCache(”sharedRDD")
sharedRDD.savePairs(sc.parallelize(1 to 100000, 10).map(i => (i, i)))
© 2018 GridGain Systems, Inc.
• Optimizing Spark’s Catalyst Engine
• In-place execution on Ignite side
• No data movement
• For most of the scenarios
Ignite DataFrames
© 2017 GridGain Systems, Inc.
1. Initial Query
2. Query execution over local data
3. Reduce multiple results in one
Ignite Node
Canada
Toronto
Ottawa
Montreal
Calgary
Ignite Node
India
Mumbai
New Delhi
1
2
23
SQL Queries Execution Flow
© 2018 GridGain Systems, Inc.
• Store DataFrames in Ignite
• Save modes
• Append
• Overwrite
• ErrorIfExists
• Ignore
SparkSession spark = _
String cfgPath = “path/to/config/file”
Dataset<Row> jsonDataFrame = spark.read().json(“path/to/file.json”);
jsonDataFrame.write()
.format(IgniteDataFrameSettings.FORMAT_IGNITE())
.mode(SaveMode.Append) // SaveMode
//... other options
.save();
Saving DataFrames
© 2018 GridGain Systems, Inc.
• Read from Ignite
• Specify format
• Specify config file
SparkSession spark = _
String cfgPath = “path/to/config/file”
Dataset<Row> df = spark.read()
.format(IgniteDataFrameSettings.FORMAT_IGNITE()) //Data source
.option(IgniteDataFrameSettings.OPTION_TABLE(), "person") //Table to read
.option(IgniteDataFrameSettings.OPTION_CONFIG_FILE(), cfgPath) //Ignite config
.load();
df.createOrReplaceTempView("person");
Dataset<Row> igniteDF = spark.sql(
"SELECT * FROM person WHERE name = 'Mary Major'");
Reading DataFrames
© 2018 GridGain Systems, Inc.
• 1 Ignite Server Node
• SensorDataGenerator
• Writes random data to a socket
• Stream
• Connects to the socket, reads sensor data and
streams via Spark; for each streamed RDD, it
creates a DataFrame and saves it into Ignite
• Query
• Creates another Spark application that uses
DataFrames integration to query data from Ignite
DataFrames Demo Setup
+
© 2018 GridGain Systems, Inc.
Any Questions?
Thank you for joining us. Follow the conversation.
http://ignite.apache.org
#apacheignite

More Related Content

What's hot (20)

PPTX
Azure data lakes
Vishwas N
 
PDF
The new big data
Adam Doyle
 
PDF
Iceberg + Alluxio for Fast Data Analytics
Alluxio, Inc.
 
PPTX
HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...
Cloudera, Inc.
 
PDF
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...
DataStax
 
PPTX
Built-In Security for the Cloud
DataWorks Summit
 
PDF
Unified Data Access with Gimel
Alluxio, Inc.
 
PDF
PostgreSQL continuous backup and PITR with Barman
EDB
 
PDF
Unleash the power of Azure Data Factory
Sergio Zenatti Filho
 
PPTX
Cloudian HyperStore Operating Environment
Cloudian
 
PPTX
Ignite Your Big Data With a Spark!
Progress
 
PDF
Azure Data Factory v2
Sergio Zenatti Filho
 
PPT
HBaseCon 2012 | Overcoming Data Deluge with HBase to Help Save the Environmen...
Cloudera, Inc.
 
PPTX
Accelerating the Hadoop data stack with Apache Ignite, Spark and Bigtop
In-Memory Computing Summit
 
PPTX
Architecting a datalake
Laurent Leturgez
 
PPTX
Backup multi-cloud solution based on named pipes
Leandro Totino Pereira
 
PDF
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Alluxio, Inc.
 
PDF
Accelerate Analytics and ML in the Hybrid Cloud Era
Alluxio, Inc.
 
PDF
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Data Con LA
 
PPTX
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Cloudian
 
Azure data lakes
Vishwas N
 
The new big data
Adam Doyle
 
Iceberg + Alluxio for Fast Data Analytics
Alluxio, Inc.
 
HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...
Cloudera, Inc.
 
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...
DataStax
 
Built-In Security for the Cloud
DataWorks Summit
 
Unified Data Access with Gimel
Alluxio, Inc.
 
PostgreSQL continuous backup and PITR with Barman
EDB
 
Unleash the power of Azure Data Factory
Sergio Zenatti Filho
 
Cloudian HyperStore Operating Environment
Cloudian
 
Ignite Your Big Data With a Spark!
Progress
 
Azure Data Factory v2
Sergio Zenatti Filho
 
HBaseCon 2012 | Overcoming Data Deluge with HBase to Help Save the Environmen...
Cloudera, Inc.
 
Accelerating the Hadoop data stack with Apache Ignite, Spark and Bigtop
In-Memory Computing Summit
 
Architecting a datalake
Laurent Leturgez
 
Backup multi-cloud solution based on named pipes
Leandro Totino Pereira
 
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Alluxio, Inc.
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Alluxio, Inc.
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Data Con LA
 
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Cloudian
 

Similar to Improving Apache Spark™ In-Memory Computing with Apache Ignite™ (20)

PDF
Data Summer Conf 2018, “Apache Ignite + Apache Spark RDDs and DataFrames inte...
Provectus
 
PDF
Fast Data with Apache Ignite and Apache Spark with Christos Erotocritou
Spark Summit
 
PDF
Spark Summit EU talk by Christos Erotocritou
Spark Summit
 
PPTX
IMC Summit 2016 Breakout - Nikita Ivanov - Shared In-Memory RDDs – Missing Li...
In-Memory Computing Summit
 
PDF
Getting Started with Apache Ignite as a Distributed Database
Roman Shtykh
 
PDF
Nike tech-talk-intro-to-apache-ignite
Dani Traphagen
 
PDF
Apache Spark and Apache Ignite: Where Fast Data Meets IoT
Denis Magda
 
PPTX
How to become an big data rockstar in 15 minutes - Akmal Chaudhri
Dataconomy Media
 
PPTX
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT
Denis Magda
 
PPTX
An Introduction to Apache Ignite - Mandhir Gidda - Codemotion Rome 2017
Codemotion
 
PDF
OSDC 2017 - Christos Erotocritou - Apache ignite in-memory data fabric
NETWAYS
 
PPTX
In-Memory Computing Essentials for Software Engineers
Denis Magda
 
PPTX
In-Memory Computing Essentials for Architects and Engineers
Denis Magda
 
PPTX
Loading data into Apache Ignite
Stephen Darlington
 
PPTX
Big Data London 2019 v.10 I 'Loading data into ignite' - Stephen Darlington, ...
Dataconomy Media
 
PDF
How we broke Apache Ignite by adding persistence, by Stephen Darlington (Grid...
Altinity Ltd
 
PPTX
How we broke Apache Ignite by adding persistence
Stephen Darlington
 
PPTX
Apache Ignite: In-Memory Hammer for Your Data Science Toolkit
Denis Magda
 
PPTX
Deploying Distributed Databases and In-Memory Computing Platforms with Kubern...
Stephen Darlington
 
PDF
Apache Ignite: In-Memory Hammer for Your Data Science Toolkit
Denis Magda
 
Data Summer Conf 2018, “Apache Ignite + Apache Spark RDDs and DataFrames inte...
Provectus
 
Fast Data with Apache Ignite and Apache Spark with Christos Erotocritou
Spark Summit
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit
 
IMC Summit 2016 Breakout - Nikita Ivanov - Shared In-Memory RDDs – Missing Li...
In-Memory Computing Summit
 
Getting Started with Apache Ignite as a Distributed Database
Roman Shtykh
 
Nike tech-talk-intro-to-apache-ignite
Dani Traphagen
 
Apache Spark and Apache Ignite: Where Fast Data Meets IoT
Denis Magda
 
How to become an big data rockstar in 15 minutes - Akmal Chaudhri
Dataconomy Media
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT
Denis Magda
 
An Introduction to Apache Ignite - Mandhir Gidda - Codemotion Rome 2017
Codemotion
 
OSDC 2017 - Christos Erotocritou - Apache ignite in-memory data fabric
NETWAYS
 
In-Memory Computing Essentials for Software Engineers
Denis Magda
 
In-Memory Computing Essentials for Architects and Engineers
Denis Magda
 
Loading data into Apache Ignite
Stephen Darlington
 
Big Data London 2019 v.10 I 'Loading data into ignite' - Stephen Darlington, ...
Dataconomy Media
 
How we broke Apache Ignite by adding persistence, by Stephen Darlington (Grid...
Altinity Ltd
 
How we broke Apache Ignite by adding persistence
Stephen Darlington
 
Apache Ignite: In-Memory Hammer for Your Data Science Toolkit
Denis Magda
 
Deploying Distributed Databases and In-Memory Computing Platforms with Kubern...
Stephen Darlington
 
Apache Ignite: In-Memory Hammer for Your Data Science Toolkit
Denis Magda
 
Ad

More from Tom Diederich (12)

PDF
Tom Diederich portfolio presentation (updated Nov. 18, 2016)
Tom Diederich
 
PDF
How to build & grow online communities: with Tom Diederich
Tom Diederich
 
PDF
Troubleshooting Apache® Ignite™
Tom Diederich
 
PDF
How to build a production-ready in-memory-based application in 1 hour
Tom Diederich
 
PPTX
Ingesting streaming data for analysis in apache ignite (stream sets theme)
Tom Diederich
 
PDF
IT Modernization in Practice
Tom Diederich
 
PDF
In-Memory Key Value Store (KVS) in FPGA for Ultra Low Latency and High Throug...
Tom Diederich
 
PDF
Machine learning and deep learning with Apache Ignite
Tom Diederich
 
PPTX
Heimdall Data: "Increase Application Performance with SQL Auto-Caching; No Co...
Tom Diederich
 
PDF
Comparing Apache Ignite and Cassandra for Hybrid Transactional/Analytical Pro...
Tom Diederich
 
PDF
“Building consistent and highly available distributed systems with Apache Ign...
Tom Diederich
 
PPTX
Quick MySQL performance check
Tom Diederich
 
Tom Diederich portfolio presentation (updated Nov. 18, 2016)
Tom Diederich
 
How to build & grow online communities: with Tom Diederich
Tom Diederich
 
Troubleshooting Apache® Ignite™
Tom Diederich
 
How to build a production-ready in-memory-based application in 1 hour
Tom Diederich
 
Ingesting streaming data for analysis in apache ignite (stream sets theme)
Tom Diederich
 
IT Modernization in Practice
Tom Diederich
 
In-Memory Key Value Store (KVS) in FPGA for Ultra Low Latency and High Throug...
Tom Diederich
 
Machine learning and deep learning with Apache Ignite
Tom Diederich
 
Heimdall Data: "Increase Application Performance with SQL Auto-Caching; No Co...
Tom Diederich
 
Comparing Apache Ignite and Cassandra for Hybrid Transactional/Analytical Pro...
Tom Diederich
 
“Building consistent and highly available distributed systems with Apache Ign...
Tom Diederich
 
Quick MySQL performance check
Tom Diederich
 
Ad

Recently uploaded (20)

PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PDF
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 

Improving Apache Spark™ In-Memory Computing with Apache Ignite™

  • 1. © 2018 GridGain Systems, Inc. Improving Apache Spark™ In-Memory Computing with Apache Ignite™ Valentin Kulichenko GridGain Systems
  • 2. © 2018 GridGain Systems, Inc. a memory-centric distributed database, caching, and processing platform for transactional, analytical, and streaming workloads, delivering in-memory speeds at petabyte scale
  • 3. © 2018 GridGain Systems, Inc. Apache Ignite Database and Caching Platform Memory-Centric Storage Ignite Native Persistence (Flash, SSD, Intel 3D XPoint) Third-Party Persistence (RDBMS, HDFS, NoSQL) SQL Transactions Compute Services MLStreamingKey/Value IoTFinancial Services Pharma & Healthcare E-CommerceTravel & Logistics Telco
  • 4. © 2018 GridGain Systems, Inc. • Distributed memory-centric database • Ingests data from HDFS or another storage • Fully fledged compute platform: SQL, transactions, key-value, collocated processing, ML/DL • Streaming and compute engine • OLAP and OLTP • Inclined towards OLAP and focused on MR payloads Comparing Ignite and Spark
  • 5. © 2018 GridGain Systems, Inc. Ignite is a memory-centric store for Spark • No data movement from Ignite to Spark • In-place query execution • Boost DataFrame and SQL performance • Share state and data among Spark jobs • Faster data and streaming analytics Ignite and Spark Together +
  • 6. © 2018 GridGain Systems, Inc. Ignite and Spark Integration Spark Application Spark Worker Spark Job Spark Job Yarn Mesos Docker HDFS Spark Worker Spark Job Spark Job Spark Worker Spark Job Spark Job In-Memory Shared RDD or DataFrame GridGain Node GridGain Node GridGain Node Share state and data among Spark jobs No data movement Boost DataFrame and SQL Performance SQL on top of RDDs In-place query execution
  • 7. © 2018 GridGain Systems, Inc. • Spark RDD abstraction • Shared view over Ignite cache/table • Mutable • Ignite SQL on top of RDDs APIs • Indexes and in-place execution Ignite Shared RDDs
  • 8. © 2018 GridGain Systems, Inc. • Standard RDD APIs + Ignite SQL • No rip-and-replace • Switch to Ignite as a storage Write to and Read from Ignite val sharedRDD: IgniteRDD[int, int] = ic.fromCache(”sharedRDD") val greaterThanFiftyThousand = sharedRDD.filter(_._2 > 50000) val df = sharedRDD.sql(”select _val from Integer where _key > 50000”) val sharedRDD: IgniteRDD[int, int] = ic.fromCache(”sharedRDD") sharedRDD.savePairs(sc.parallelize(1 to 100000, 10).map(i => (i, i)))
  • 9. © 2018 GridGain Systems, Inc. • Optimizing Spark’s Catalyst Engine • In-place execution on Ignite side • No data movement • For most of the scenarios Ignite DataFrames
  • 10. © 2017 GridGain Systems, Inc. 1. Initial Query 2. Query execution over local data 3. Reduce multiple results in one Ignite Node Canada Toronto Ottawa Montreal Calgary Ignite Node India Mumbai New Delhi 1 2 23 SQL Queries Execution Flow
  • 11. © 2018 GridGain Systems, Inc. • Store DataFrames in Ignite • Save modes • Append • Overwrite • ErrorIfExists • Ignore SparkSession spark = _ String cfgPath = “path/to/config/file” Dataset<Row> jsonDataFrame = spark.read().json(“path/to/file.json”); jsonDataFrame.write() .format(IgniteDataFrameSettings.FORMAT_IGNITE()) .mode(SaveMode.Append) // SaveMode //... other options .save(); Saving DataFrames
  • 12. © 2018 GridGain Systems, Inc. • Read from Ignite • Specify format • Specify config file SparkSession spark = _ String cfgPath = “path/to/config/file” Dataset<Row> df = spark.read() .format(IgniteDataFrameSettings.FORMAT_IGNITE()) //Data source .option(IgniteDataFrameSettings.OPTION_TABLE(), "person") //Table to read .option(IgniteDataFrameSettings.OPTION_CONFIG_FILE(), cfgPath) //Ignite config .load(); df.createOrReplaceTempView("person"); Dataset<Row> igniteDF = spark.sql( "SELECT * FROM person WHERE name = 'Mary Major'"); Reading DataFrames
  • 13. © 2018 GridGain Systems, Inc. • 1 Ignite Server Node • SensorDataGenerator • Writes random data to a socket • Stream • Connects to the socket, reads sensor data and streams via Spark; for each streamed RDD, it creates a DataFrame and saves it into Ignite • Query • Creates another Spark application that uses DataFrames integration to query data from Ignite DataFrames Demo Setup +
  • 14. © 2018 GridGain Systems, Inc. Any Questions? Thank you for joining us. Follow the conversation. http://ignite.apache.org #apacheignite