SlideShare a Scribd company logo
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Chaudhri at Big Data Spain 2017
Akmal Chaudhri, GridGain Systems
Boost Hadoop and Spark with
in-memory technologies
Agenda
• Introduction to Apache Ignite
• Hadoop Acceleration
• Spark Acceleration
• Demos
• Q&A
Big Data Spain 2017
Apache Ignite in one slide
• Memory-centric platform
– that is strongly consistent
– and highly-available
– with powerful SQL
– key-value and processing
APIs
• Designed for
– Performance
– Scalability
Big Data Spain 2017
Apache Ignite
• Data source agnostic
• Fully fledged compute engine and durable storage
• OLAP and OLTP
• Fully ACID transactions across memory and disk
• In-memory SQL support
• Early ML libraries
• Growing community
Big Data Spain 2017
Hadoop Acceleration
• In-memory Hadoop Execution
• Alternative job tracker
– Faster MapReduce
• Built on Ignite File System (IGFS)
• Secondary File System
– Read-through and Write-through
Big Data Spain 2017
Ignite In-Memory File System
• Distributed in-memory
file system
• Implements HDFS
API
• Can be transparently
plugged into Hadoop
or Spark deployments
Big Data Spain 2017
MapReduce
Big Data Spain 2017
MapReduce
• Parallelize processing of data in HDFS
• Eliminate Hadoop JobTracker and TaskTracker
overhead
• Low-Latency distributed processing
• Minimal configuration change
Big Data Spain 2017
Spark Acceleration
• Long running applications
– Passing state between jobs
• Disk File System
– Convert RDDs to disk files and back
• Share RDDs in-memory
– Native Spark API
– Native Spark transformations
Big Data Spain 2017
Ignite for Spark
• Spark RDD abstraction
• Shared in-memory view
on data across different
Spark jobs, workers or
applications
• Implemented as a view
over a distributed Ignite
cache
Big Data Spain 2017
IgniteContext
• Main entry-point to Spark-Ignite integration
• SparkContext plus either one of
– IgniteConfiguration()
– Path to XML configuration file
• Optional Boolean client argument
– true => Shared deployment
– false => Embedded deployment
Big Data Spain 2017
IgniteContext examples
Big Data Spain 2017
valigniteContext= new IgniteContext(sparkContext,
()= > new IgniteConfiguration())
valigniteContext= new IgniteContext(sparkContext,
"exam ples/config/spark/exam ple-shared-rdd.xm l")
IgniteRDD
• Implementation of Spark RDD representing a live
view of an Ignite cache
• Mutable (unlike native RDDs)
– All changes in Ignite cache will be visible to RDD users
immediately
• Provides partitioning information to Spark executor
• Provides affinity information to Spark so that RDD
computations can use data locality
Big Data Spain 2017
Write to Ignite
• Ignite caches operate on key-value pairs
• Spark tuple RDD for key-value pairs and
savePairs method
– RDD partitioning, store values in parallel if possible
• Value-only RDD and saveValues method
– IgniteRDD generates a unique affinity-local key for
each value stored into the cache
Big Data Spain 2017
Write code example
Big Data Spain 2017
valconf= new SparkConf().setAppNam e("SparkIgniteW riter")
valsc = new SparkContext(conf)
valic = new IgniteContext(sc,
"exam ples/config/spark/exam ple-shared-rdd.xm l")
valsharedRD D :IgniteRD D [Int,Int]= ic.from Cache("sharedRD D ")
sharedRD D .savePairs(sc.parallelize(1 to 100000,10)
.m ap(i= > (i,i)))
Read from Ignite
• IgniteRDD is a live view of an Ignite cache
– No need to explicitly load data to Spark application
from Ignite
– All RDD methods are available to use right away after
an instance of IgniteRDD is created
Big Data Spain 2017
Read code example
Big Data Spain 2017
valconf= new SparkConf().setAppNam e("SparkIgniteReader")
valsc = new SparkContext(conf)
valic = new IgniteContext(sc,
"exam ples/config/spark/exam ple-shared-rdd.xm l")
valsharedRD D :IgniteRD D [Int,Int]= ic.from Cache("sharedRD D ")
valgreaterThanFiftyThousand = sharedRD D .filter(_._2 > 50000)
println("The countis "+ greaterThanFiftyThousand.count())
Demos
Big Data Spain 2017
Any Questions?
Thank you for joining us. Follow the conversation.
http://ignite.apache.org
Big Data Spain 2017

More Related Content

What's hot (20)

PDF
Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...
Databricks
 
PDF
Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...
Spark Summit
 
PPTX
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
DataWorks Summit
 
PDF
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
Databricks
 
PDF
Performance Analysis of Apache Spark and Presto in Cloud Environments
Databricks
 
PDF
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
Spark Summit
 
PDF
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Spark Summit
 
PDF
Reliable Performance at Scale with Apache Spark on Kubernetes
Databricks
 
PDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
Keith Kraus
 
PPTX
Data streaming
Alberto Paro
 
PDF
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Spark Summit
 
PPTX
Unlocking Your Hadoop Data with Apache Spark and CDH5
SAP Concur
 
PDF
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Databricks
 
PDF
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Spark Summit
 
PDF
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Spark Summit
 
PDF
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
Nathan Bijnens
 
PDF
Introduction to TitanDB
Knoldus Inc.
 
PDF
Spark Streaming and MLlib - Hyderabad Spark Group
Phaneendra Chiruvella
 
PDF
Petabytes, Exabytes, and Beyond: Managing Delta Lakes for Interactive Queries...
Databricks
 
PDF
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Databricks
 
Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...
Databricks
 
Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...
Spark Summit
 
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
DataWorks Summit
 
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
Databricks
 
Performance Analysis of Apache Spark and Presto in Cloud Environments
Databricks
 
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
Spark Summit
 
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Spark Summit
 
Reliable Performance at Scale with Apache Spark on Kubernetes
Databricks
 
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
Keith Kraus
 
Data streaming
Alberto Paro
 
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Spark Summit
 
Unlocking Your Hadoop Data with Apache Spark and CDH5
SAP Concur
 
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Databricks
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Spark Summit
 
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Spark Summit
 
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
Nathan Bijnens
 
Introduction to TitanDB
Knoldus Inc.
 
Spark Streaming and MLlib - Hyderabad Spark Group
Phaneendra Chiruvella
 
Petabytes, Exabytes, and Beyond: Managing Delta Lakes for Interactive Queries...
Databricks
 
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Databricks
 

Similar to Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Chaudhri at Big Data Spain 2017 (20)

ODP
What is Apache spark
manisha1110
 
PDF
Apache Ignite
Mike Frampton
 
PPTX
IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...
In-Memory Computing Summit
 
PDF
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
Spark Summit
 
PPTX
Accelerating the Hadoop data stack with Apache Ignite, Spark and Bigtop
In-Memory Computing Summit
 
PDF
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
Yahoo Developer Network
 
PDF
Spark Summit EU talk by Christos Erotocritou
Spark Summit
 
PDF
Started with-apache-spark
Happiest Minds Technologies
 
PDF
Improving Apache Spark™ In-Memory Computing with Apache Ignite™
Tom Diederich
 
PPTX
IMC Summit 2016 Breakout - Nikita Ivanov - Shared In-Memory RDDs – Missing Li...
In-Memory Computing Summit
 
PPTX
Continuous Machine and Deep Learning with Apache Ignite
Denis Magda
 
PDF
Data Summer Conf 2018, “Apache Ignite + Apache Spark RDDs and DataFrames inte...
Provectus
 
PDF
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
PPTX
Apache ignite v1.3
Klearchos Klearchou
 
PPTX
Big Data Processing with Apache Spark 2014
mahchiev
 
PPTX
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
PPTX
Big Data tools in practice
Darko Marjanovic
 
PDF
Big Data and OSS at IBM
Boulder Java User's Group
 
PDF
Hadoop at ayasdi
Mohit Jaggi
 
PDF
Apache Spark and the Emerging Technology Landscape for Big Data
Paco Nathan
 
What is Apache spark
manisha1110
 
Apache Ignite
Mike Frampton
 
IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...
In-Memory Computing Summit
 
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
Spark Summit
 
Accelerating the Hadoop data stack with Apache Ignite, Spark and Bigtop
In-Memory Computing Summit
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
Yahoo Developer Network
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit
 
Started with-apache-spark
Happiest Minds Technologies
 
Improving Apache Spark™ In-Memory Computing with Apache Ignite™
Tom Diederich
 
IMC Summit 2016 Breakout - Nikita Ivanov - Shared In-Memory RDDs – Missing Li...
In-Memory Computing Summit
 
Continuous Machine and Deep Learning with Apache Ignite
Denis Magda
 
Data Summer Conf 2018, “Apache Ignite + Apache Spark RDDs and DataFrames inte...
Provectus
 
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
Apache ignite v1.3
Klearchos Klearchou
 
Big Data Processing with Apache Spark 2014
mahchiev
 
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
Big Data tools in practice
Darko Marjanovic
 
Big Data and OSS at IBM
Boulder Java User's Group
 
Hadoop at ayasdi
Mohit Jaggi
 
Apache Spark and the Emerging Technology Landscape for Big Data
Paco Nathan
 
Ad

More from Big Data Spain (20)

PDF
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data Spain
 
PDF
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Big Data Spain
 
PDF
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
Big Data Spain
 
PDF
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Big Data Spain
 
PDF
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Big Data Spain
 
PDF
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Big Data Spain
 
PDF
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Big Data Spain
 
PDF
State of the art time-series analysis with deep learning by Javier Ordóñez at...
Big Data Spain
 
PDF
Trading at market speed with the latest Kafka features by Iñigo González at B...
Big Data Spain
 
PDF
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Big Data Spain
 
PDF
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
Big Data Spain
 
PDF
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Big Data Spain
 
PDF
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Big Data Spain
 
PDF
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Big Data Spain
 
PDF
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Big Data Spain
 
PDF
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
Big Data Spain
 
PDF
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Big Data Spain
 
PDF
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
Big Data Spain
 
PDF
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Big Data Spain
 
PDF
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
Big Data Spain
 
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data Spain
 
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Big Data Spain
 
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
Big Data Spain
 
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Big Data Spain
 
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Big Data Spain
 
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Big Data Spain
 
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Big Data Spain
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
Big Data Spain
 
Trading at market speed with the latest Kafka features by Iñigo González at B...
Big Data Spain
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Big Data Spain
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
Big Data Spain
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Big Data Spain
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Big Data Spain
 
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Big Data Spain
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Big Data Spain
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
Big Data Spain
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Big Data Spain
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
Big Data Spain
 
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Big Data Spain
 
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
Big Data Spain
 
Ad

Recently uploaded (20)

PPTX
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PDF
Impact of IEEE Computer Society in Advancing Emerging Technologies including ...
Hironori Washizaki
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PPT
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
PDF
Persuasive AI: risks and opportunities in the age of digital debate
Speck&Tech
 
PDF
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
Wojciech Ciemski for Top Cyber News MAGAZINE. June 2025
Dr. Ludmila Morozova-Buss
 
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
Impact of IEEE Computer Society in Advancing Emerging Technologies including ...
Hironori Washizaki
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
Persuasive AI: risks and opportunities in the age of digital debate
Speck&Tech
 
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Wojciech Ciemski for Top Cyber News MAGAZINE. June 2025
Dr. Ludmila Morozova-Buss
 

Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Chaudhri at Big Data Spain 2017

  • 2. Akmal Chaudhri, GridGain Systems Boost Hadoop and Spark with in-memory technologies
  • 3. Agenda • Introduction to Apache Ignite • Hadoop Acceleration • Spark Acceleration • Demos • Q&A Big Data Spain 2017
  • 4. Apache Ignite in one slide • Memory-centric platform – that is strongly consistent – and highly-available – with powerful SQL – key-value and processing APIs • Designed for – Performance – Scalability Big Data Spain 2017
  • 5. Apache Ignite • Data source agnostic • Fully fledged compute engine and durable storage • OLAP and OLTP • Fully ACID transactions across memory and disk • In-memory SQL support • Early ML libraries • Growing community Big Data Spain 2017
  • 6. Hadoop Acceleration • In-memory Hadoop Execution • Alternative job tracker – Faster MapReduce • Built on Ignite File System (IGFS) • Secondary File System – Read-through and Write-through Big Data Spain 2017
  • 7. Ignite In-Memory File System • Distributed in-memory file system • Implements HDFS API • Can be transparently plugged into Hadoop or Spark deployments Big Data Spain 2017
  • 9. MapReduce • Parallelize processing of data in HDFS • Eliminate Hadoop JobTracker and TaskTracker overhead • Low-Latency distributed processing • Minimal configuration change Big Data Spain 2017
  • 10. Spark Acceleration • Long running applications – Passing state between jobs • Disk File System – Convert RDDs to disk files and back • Share RDDs in-memory – Native Spark API – Native Spark transformations Big Data Spain 2017
  • 11. Ignite for Spark • Spark RDD abstraction • Shared in-memory view on data across different Spark jobs, workers or applications • Implemented as a view over a distributed Ignite cache Big Data Spain 2017
  • 12. IgniteContext • Main entry-point to Spark-Ignite integration • SparkContext plus either one of – IgniteConfiguration() – Path to XML configuration file • Optional Boolean client argument – true => Shared deployment – false => Embedded deployment Big Data Spain 2017
  • 13. IgniteContext examples Big Data Spain 2017 valigniteContext= new IgniteContext(sparkContext, ()= > new IgniteConfiguration()) valigniteContext= new IgniteContext(sparkContext, "exam ples/config/spark/exam ple-shared-rdd.xm l")
  • 14. IgniteRDD • Implementation of Spark RDD representing a live view of an Ignite cache • Mutable (unlike native RDDs) – All changes in Ignite cache will be visible to RDD users immediately • Provides partitioning information to Spark executor • Provides affinity information to Spark so that RDD computations can use data locality Big Data Spain 2017
  • 15. Write to Ignite • Ignite caches operate on key-value pairs • Spark tuple RDD for key-value pairs and savePairs method – RDD partitioning, store values in parallel if possible • Value-only RDD and saveValues method – IgniteRDD generates a unique affinity-local key for each value stored into the cache Big Data Spain 2017
  • 16. Write code example Big Data Spain 2017 valconf= new SparkConf().setAppNam e("SparkIgniteW riter") valsc = new SparkContext(conf) valic = new IgniteContext(sc, "exam ples/config/spark/exam ple-shared-rdd.xm l") valsharedRD D :IgniteRD D [Int,Int]= ic.from Cache("sharedRD D ") sharedRD D .savePairs(sc.parallelize(1 to 100000,10) .m ap(i= > (i,i)))
  • 17. Read from Ignite • IgniteRDD is a live view of an Ignite cache – No need to explicitly load data to Spark application from Ignite – All RDD methods are available to use right away after an instance of IgniteRDD is created Big Data Spain 2017
  • 18. Read code example Big Data Spain 2017 valconf= new SparkConf().setAppNam e("SparkIgniteReader") valsc = new SparkContext(conf) valic = new IgniteContext(sc, "exam ples/config/spark/exam ple-shared-rdd.xm l") valsharedRD D :IgniteRD D [Int,Int]= ic.from Cache("sharedRD D ") valgreaterThanFiftyThousand = sharedRD D .filter(_._2 > 50000) println("The countis "+ greaterThanFiftyThousand.count())
  • 20. Any Questions? Thank you for joining us. Follow the conversation. http://ignite.apache.org Big Data Spain 2017