Improving Apache Spark™ In-Memory Computing with Apache Ignite™

© 2018 GridGain Systems, Inc.
Improving Apache Spark™ In-Memory
Computing with Apache Ignite™
Valentin Kulichenko
GridGain Systems

a memory-centric distributed
database, caching, and processing platform
for transactional, analytical, and streaming workloads,
delivering in-memory speeds at petabyte scale

Apache Ignite Database and Caching Platform
Memory-Centric Storage
Ignite Native Persistence
(Flash, SSD, Intel 3D XPoint)
Third-Party Persistence
(RDBMS, HDFS, NoSQL)
SQL Transactions Compute Services MLStreamingKey/Value
IoTFinancial
Services
Pharma &
Healthcare
E-CommerceTravel &
Logistics
Telco

• Distributed memory-centric database • Ingests data from HDFS or another
storage
• Fully fledged compute platform: SQL,
transactions, key-value, collocated
processing, ML/DL
• Streaming and compute engine
• OLAP and OLTP • Inclined towards OLAP and focused on
MR payloads
Comparing Ignite and Spark

Ignite is a memory-centric store for Spark
• No data movement from Ignite to Spark
• In-place query execution
• Boost DataFrame and SQL performance
• Share state and data among Spark jobs
• Faster data and streaming analytics
Ignite and Spark Together
+

Ignite and Spark Integration
Spark Application
Spark Worker
Spark
Job
Spark
Job
Yarn Mesos Docker HDFS
Spark Worker
Spark
Job
Spark
Job
Spark Worker
Spark
Job
Spark
Job
In-Memory Shared RDD or DataFrame
GridGain Node GridGain Node GridGain Node
Share state and
data among
Spark jobs
No data
movement
Boost DataFrame
and SQL
Performance
SQL on top
of RDDs
In-place query
execution

• Spark RDD abstraction
• Shared view over Ignite cache/table
• Mutable
• Ignite SQL on top of RDDs APIs
• Indexes and in-place execution
Ignite Shared RDDs

• Standard RDD APIs + Ignite SQL
• No rip-and-replace
• Switch to Ignite as a storage
Write to and Read from Ignite
val sharedRDD: IgniteRDD[int, int] = ic.fromCache(”sharedRDD")
val greaterThanFiftyThousand = sharedRDD.filter(_._2 > 50000)
val df = sharedRDD.sql(”select _val from Integer where _key > 50000”)
val sharedRDD: IgniteRDD[int, int] = ic.fromCache(”sharedRDD")
sharedRDD.savePairs(sc.parallelize(1 to 100000, 10).map(i => (i, i)))

• Optimizing Spark’s Catalyst Engine
• In-place execution on Ignite side
• No data movement
• For most of the scenarios
Ignite DataFrames

1. Initial Query
2. Query execution over local data
3. Reduce multiple results in one
Ignite Node
Canada
Toronto
Ottawa
Montreal
Calgary
Ignite Node
India
Mumbai
New Delhi
1
2
23
SQL Queries Execution Flow

• Store DataFrames in Ignite
• Save modes
• Append
• Overwrite
• ErrorIfExists
• Ignore
SparkSession spark = _
String cfgPath = “path/to/config/file”
Dataset<Row> jsonDataFrame = spark.read().json(“path/to/file.json”);
jsonDataFrame.write()
.format(IgniteDataFrameSettings.FORMAT_IGNITE())
.mode(SaveMode.Append) // SaveMode
//... other options
.save();
Saving DataFrames

• Read from Ignite
• Specify format
• Specify config file
SparkSession spark = _
String cfgPath = “path/to/config/file”
Dataset<Row> df = spark.read()
.format(IgniteDataFrameSettings.FORMAT_IGNITE()) //Data source
.option(IgniteDataFrameSettings.OPTION_TABLE(), "person") //Table to read
.option(IgniteDataFrameSettings.OPTION_CONFIG_FILE(), cfgPath) //Ignite config
.load();
df.createOrReplaceTempView("person");
Dataset<Row> igniteDF = spark.sql(
"SELECT * FROM person WHERE name = 'Mary Major'");
Reading DataFrames

• 1 Ignite Server Node
• SensorDataGenerator
• Writes random data to a socket
• Stream
• Connects to the socket, reads sensor data and
streams via Spark; for each streamed RDD, it
creates a DataFrame and saves it into Ignite
• Query
• Creates another Spark application that uses
DataFrames integration to query data from Ignite
DataFrames Demo Setup
+

Any Questions?
Thank you for joining us. Follow the conversation.
http://ignite.apache.org
#apacheignite

Improving Apache Spark™ In-Memory Computing with Apache Ignite™

More Related Content

What's hot (20)

Similar to Improving Apache Spark™ In-Memory Computing with Apache Ignite™ (20)

More from Tom Diederich (12)

Recently uploaded (20)

Improving Apache Spark™ In-Memory Computing with Apache Ignite™