Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Chaudhri at Big Data Spain 2017

Akmal Chaudhri, GridGain Systems
Boost Hadoop and Spark with
in-memory technologies

Agenda
• Introduction to Apache Ignite
• Hadoop Acceleration
• Spark Acceleration
• Demos
• Q&A
Big Data Spain 2017

Apache Ignite in one slide
• Memory-centric platform
– that is strongly consistent
– and highly-available
– with powerful SQL
– key-value and processing
APIs
• Designed for
– Performance
– Scalability
Big Data Spain 2017

Apache Ignite
• Data source agnostic
• Fully fledged compute engine and durable storage
• OLAP and OLTP
• Fully ACID transactions across memory and disk
• In-memory SQL support
• Early ML libraries
• Growing community
Big Data Spain 2017

Hadoop Acceleration
• In-memory Hadoop Execution
• Alternative job tracker
– Faster MapReduce
• Built on Ignite File System (IGFS)
• Secondary File System
– Read-through and Write-through
Big Data Spain 2017

Ignite In-Memory File System
• Distributed in-memory
file system
• Implements HDFS
API
• Can be transparently
plugged into Hadoop
or Spark deployments
Big Data Spain 2017

MapReduce
• Parallelize processing of data in HDFS
• Eliminate Hadoop JobTracker and TaskTracker
overhead
• Low-Latency distributed processing
• Minimal configuration change
Big Data Spain 2017

Spark Acceleration
• Long running applications
– Passing state between jobs
• Disk File System
– Convert RDDs to disk files and back
• Share RDDs in-memory
– Native Spark API
– Native Spark transformations
Big Data Spain 2017

Ignite for Spark
• Spark RDD abstraction
• Shared in-memory view
on data across different
Spark jobs, workers or
applications
• Implemented as a view
over a distributed Ignite
cache
Big Data Spain 2017

IgniteContext
• Main entry-point to Spark-Ignite integration
• SparkContext plus either one of
– IgniteConfiguration()
– Path to XML configuration file
• Optional Boolean client argument
– true => Shared deployment
– false => Embedded deployment
Big Data Spain 2017

IgniteContext examples
Big Data Spain 2017
valigniteContext= new IgniteContext(sparkContext,
()= > new IgniteConfiguration())
valigniteContext= new IgniteContext(sparkContext,
"exam ples/config/spark/exam ple-shared-rdd.xm l")

IgniteRDD
• Implementation of Spark RDD representing a live
view of an Ignite cache
• Mutable (unlike native RDDs)
– All changes in Ignite cache will be visible to RDD users
immediately
• Provides partitioning information to Spark executor
• Provides affinity information to Spark so that RDD
computations can use data locality
Big Data Spain 2017

Write to Ignite
• Ignite caches operate on key-value pairs
• Spark tuple RDD for key-value pairs and
savePairs method
– RDD partitioning, store values in parallel if possible
• Value-only RDD and saveValues method
– IgniteRDD generates a unique affinity-local key for
each value stored into the cache
Big Data Spain 2017

Write code example
Big Data Spain 2017
valconf= new SparkConf().setAppNam e("SparkIgniteW riter")
valsc = new SparkContext(conf)
valic = new IgniteContext(sc,
valsharedRD D :IgniteRD D [Int,Int]= ic.from Cache("sharedRD D ")
sharedRD D .savePairs(sc.parallelize(1 to 100000,10)
.m ap(i= > (i,i)))

Read from Ignite
• IgniteRDD is a live view of an Ignite cache
– No need to explicitly load data to Spark application
from Ignite
– All RDD methods are available to use right away after
an instance of IgniteRDD is created
Big Data Spain 2017

Read code example
Big Data Spain 2017
valconf= new SparkConf().setAppNam e("SparkIgniteReader")
valsc = new SparkContext(conf)
valic = new IgniteContext(sc,
valsharedRD D :IgniteRD D [Int,Int]= ic.from Cache("sharedRD D ")
valgreaterThanFiftyThousand = sharedRD D .filter(_._2 > 50000)
println("The countis "+ greaterThanFiftyThousand.count())

Any Questions?
Thank you for joining us. Follow the conversation.
http://ignite.apache.org
Big Data Spain 2017

Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Chaudhri at Big Data Spain 2017

More Related Content

What's hot (20)

Similar to Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Chaudhri at Big Data Spain 2017 (20)

More from Big Data Spain (20)

Recently uploaded (20)

Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Chaudhri at Big Data Spain 2017