SlideShare a Scribd company logo
INTRODUCTION TO APACHE SPARK
JUGBD MEETUP #5.0
MAY 23, 2015
MUKTADIUR RAHMAN
TEAM LEAD, M&H INFORMATICS(BD) LTD.
OVERVIEW
• Apache Spark is a cluster computing framework that provide :
• fast and general engine for large-scale data processing
• Run programs up to 100x faster than Hadoop MapReduce
in memory, or 10x faster on disk
• Simple API in Scala, Java, Python
• This talk will cover :
• Components of Spark Stack
• Resilient Distributed DataSet(RDD)
• Programming with Spark
A BRIEF HISTORY OF SPARK
• Spark started by Matei Zaharia in 2009 as a research project
in the UC Berkeley RAD Lab, later to become the AMPLab.
• Spark was first open sourced in March 2010 and transferred
to the Apache Software Foundation in June 2013
• Spark had over 465 contributors in 2014,making it the most
active project in the Apache Software Foundation and
among Big Data open source projects
• Spark 1.3.1, released on April 17, 2015(http://
www.apache.org/dyn/closer.cgi/spark/spark-1.3.1/
spark-1.3.1-bin-hadoop2.6.tgz)
SPARK STACK
Resilient Distributed Datasets (RDD)
An RDD in Spark is simply an immutable distributed collection
of objects. Each RDD is split into multiple partitions, which
may be computed on different nodes of the cluster.
RDDs can be created in two ways:
• by loading an external dataset
•scala> val reads = sc.textFile(“README.md”)
• by distributing a collection of objects
•scala> val data = sc.parallelize(1 to 100000)
RDD
Once created, RDDs offer two types of operations:
• transformations
• actions
Example :
Step 1 : Create a RDD
scala> val data = sc.textFile(“README.md")
Step 2: Transformation
scala> val lines = data.filter(line=>line.contains(“Spark"))
Step 3: Action
scala> lines.count()
RDD
Persisting an RDD in memory
Example :
Step 1 : Create a RDD
scala> val data = sc.textFile(“README.md")
Step 2: Transformation
scala> val lines = data.filter(line=>line.contains(“Spark"))
Step 3: Persistent in memory
scala> lines.cache() or lines.persist()
Step 4: Unpersist memory
scala> lines.unpersist()
Step 5: Action
scala> lines.count()
SPARK EXAMPLE : WORD COUNT
Scala>>
var data = sc.textFile(“README.md")
var counts = data.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("/tmp/output")
SPARK EXAMPLE : WORD COUNT
Java 8>>
JavaRDD<String> data = sc.textFile(“README.md");
JavaRDD<String> words =
data.flatMap(line -> Arrays.asList(line.split(" “)));
JavaPairRDD<String, Integer> counts =
words.mapToPair(w -> new Tuple2<String, Integer>(w, 1))
.reduceByKey((x, y) -> x + y);
counts.saveAsTextFile(“/tmp/output“);
RESOURCES
• https://spark.apache.org/docs/latest/
• http://shop.oreilly.com/product/0636920028512.do
• https://www.edx.org/course/introduction-big-data-
apache-spark-uc-berkeleyx-cs100-1x
• https://www.edx.org/course/scalable-machine-
learning-uc-berkeleyx-cs190-1x
• https://www.facebook.com/groups/
898580040204667/
Q/A
Thank YOU!

More Related Content

What's hot (20)

PPTX
Apache spark - History and market overview
Martin Zapletal
 
PPTX
Intro to Spark
Kyle Burke
 
PPTX
Spark from the Surface
Josi Aranda
 
PPTX
Spark Introduction
DataStax Academy
 
PPTX
Apache Cassandra Lunch #70: Basics of Apache Cassandra
Anant Corporation
 
PDF
The SparkSQL things you maybe confuse
vito jeng
 
PPTX
Apache Spark Fundamentals
Zahra Eskandari
 
PPTX
Cassandra Lunch #87: Recreating Cassandra.api using Astra and Stargate
Anant Corporation
 
PPTX
Data Engineer's Lunch #54: dbt and Spark
Anant Corporation
 
PDF
Introduction to apache spark
Muktadiur Rahman
 
PDF
Databases and how to choose them
Datio Big Data
 
PPTX
Introduction to Apache Spark and MLlib
pumaranikar
 
PDF
#MesosCon 2014: Spark on Mesos
Paco Nathan
 
PPTX
Cassandra
Pooja GV
 
PPTX
Spark + Cassandra
Carl Yeksigian
 
PDF
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark Summit
 
PDF
Apache Spark part of Eindhoven Java Meetup
Patrick Deenen
 
PPTX
Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
DataStax Academy
 
PPTX
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
PDF
Cassandra + Spark + Elk
Vasil Remeniuk
 
Apache spark - History and market overview
Martin Zapletal
 
Intro to Spark
Kyle Burke
 
Spark from the Surface
Josi Aranda
 
Spark Introduction
DataStax Academy
 
Apache Cassandra Lunch #70: Basics of Apache Cassandra
Anant Corporation
 
The SparkSQL things you maybe confuse
vito jeng
 
Apache Spark Fundamentals
Zahra Eskandari
 
Cassandra Lunch #87: Recreating Cassandra.api using Astra and Stargate
Anant Corporation
 
Data Engineer's Lunch #54: dbt and Spark
Anant Corporation
 
Introduction to apache spark
Muktadiur Rahman
 
Databases and how to choose them
Datio Big Data
 
Introduction to Apache Spark and MLlib
pumaranikar
 
#MesosCon 2014: Spark on Mesos
Paco Nathan
 
Cassandra
Pooja GV
 
Spark + Cassandra
Carl Yeksigian
 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark Summit
 
Apache Spark part of Eindhoven Java Meetup
Patrick Deenen
 
Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
DataStax Academy
 
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
Cassandra + Spark + Elk
Vasil Remeniuk
 

Similar to Introduction to apache spark (20)

PPTX
Intro to Apache Spark
clairvoyantllc
 
PPTX
Intro to Apache Spark
Robert Sanders
 
PDF
Introduction to Apache Spark
Vincent Poncet
 
PPT
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
PPTX
Introduction to Apache Spark
Mohamed hedi Abidi
 
PDF
Apache Spark and DataStax Enablement
Vincent Poncet
 
PDF
Big Data Processing using Apache Spark and Clojure
Dr. Christian Betz
 
PPTX
SparkNotes
Demet Aksoy
 
PPTX
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
PDF
Reactive dashboard’s using apache spark
Rahul Kumar
 
PPTX
Introduction to Apache Spark
Rahul Jain
 
PPTX
Spark core
Prashant Gupta
 
PPTX
APACHE SPARK.pptx
DeepaThirumurugan
 
PPTX
Apache Spark Introduction @ University College London
Vitthal Gogate
 
PDF
Introduction to Apache Spark
Anastasios Skarlatidis
 
PPTX
Big Data Processing with Apache Spark 2014
mahchiev
 
PPTX
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
PDF
Apache Spark RDDs
Dean Chen
 
PDF
Apache Spark Introduction
sudhakara st
 
PDF
Apache Spark: What? Why? When?
Massimo Schenone
 
Intro to Apache Spark
clairvoyantllc
 
Intro to Apache Spark
Robert Sanders
 
Introduction to Apache Spark
Vincent Poncet
 
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Introduction to Apache Spark
Mohamed hedi Abidi
 
Apache Spark and DataStax Enablement
Vincent Poncet
 
Big Data Processing using Apache Spark and Clojure
Dr. Christian Betz
 
SparkNotes
Demet Aksoy
 
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
Reactive dashboard’s using apache spark
Rahul Kumar
 
Introduction to Apache Spark
Rahul Jain
 
Spark core
Prashant Gupta
 
APACHE SPARK.pptx
DeepaThirumurugan
 
Apache Spark Introduction @ University College London
Vitthal Gogate
 
Introduction to Apache Spark
Anastasios Skarlatidis
 
Big Data Processing with Apache Spark 2014
mahchiev
 
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
Apache Spark RDDs
Dean Chen
 
Apache Spark Introduction
sudhakara st
 
Apache Spark: What? Why? When?
Massimo Schenone
 
Ad

Recently uploaded (20)

PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
Ad

Introduction to apache spark

  • 1. INTRODUCTION TO APACHE SPARK JUGBD MEETUP #5.0 MAY 23, 2015 MUKTADIUR RAHMAN TEAM LEAD, M&H INFORMATICS(BD) LTD.
  • 2. OVERVIEW • Apache Spark is a cluster computing framework that provide : • fast and general engine for large-scale data processing • Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk • Simple API in Scala, Java, Python • This talk will cover : • Components of Spark Stack • Resilient Distributed DataSet(RDD) • Programming with Spark
  • 3. A BRIEF HISTORY OF SPARK • Spark started by Matei Zaharia in 2009 as a research project in the UC Berkeley RAD Lab, later to become the AMPLab. • Spark was first open sourced in March 2010 and transferred to the Apache Software Foundation in June 2013 • Spark had over 465 contributors in 2014,making it the most active project in the Apache Software Foundation and among Big Data open source projects • Spark 1.3.1, released on April 17, 2015(http:// www.apache.org/dyn/closer.cgi/spark/spark-1.3.1/ spark-1.3.1-bin-hadoop2.6.tgz)
  • 5. Resilient Distributed Datasets (RDD) An RDD in Spark is simply an immutable distributed collection of objects. Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster. RDDs can be created in two ways: • by loading an external dataset •scala> val reads = sc.textFile(“README.md”) • by distributing a collection of objects •scala> val data = sc.parallelize(1 to 100000)
  • 6. RDD Once created, RDDs offer two types of operations: • transformations • actions Example : Step 1 : Create a RDD scala> val data = sc.textFile(“README.md") Step 2: Transformation scala> val lines = data.filter(line=>line.contains(“Spark")) Step 3: Action scala> lines.count()
  • 7. RDD Persisting an RDD in memory Example : Step 1 : Create a RDD scala> val data = sc.textFile(“README.md") Step 2: Transformation scala> val lines = data.filter(line=>line.contains(“Spark")) Step 3: Persistent in memory scala> lines.cache() or lines.persist() Step 4: Unpersist memory scala> lines.unpersist() Step 5: Action scala> lines.count()
  • 8. SPARK EXAMPLE : WORD COUNT Scala>> var data = sc.textFile(“README.md") var counts = data.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("/tmp/output")
  • 9. SPARK EXAMPLE : WORD COUNT Java 8>> JavaRDD<String> data = sc.textFile(“README.md"); JavaRDD<String> words = data.flatMap(line -> Arrays.asList(line.split(" “))); JavaPairRDD<String, Integer> counts = words.mapToPair(w -> new Tuple2<String, Integer>(w, 1)) .reduceByKey((x, y) -> x + y); counts.saveAsTextFile(“/tmp/output“);
  • 10. RESOURCES • https://spark.apache.org/docs/latest/ • http://shop.oreilly.com/product/0636920028512.do • https://www.edx.org/course/introduction-big-data- apache-spark-uc-berkeleyx-cs100-1x • https://www.edx.org/course/scalable-machine- learning-uc-berkeleyx-cs190-1x • https://www.facebook.com/groups/ 898580040204667/