Apache spark basics

Apache Spark Basics
Apache Spark 1

Starting Spark
Apache Spark 2
Change the directory.
cd $SPARK_HOME
Start spark-shell by typing below command.
./bin/spark-shell
Start pyspark by typing below command.
./bin/pyspark
Start SparkR by typing below command.
./bin/sparkR

Spark Application details
Apache Spark 3
Driver program: Program which runs the user’s main function and executes various parallel
operations on a cluster.
SparkConf :Object that contains information about your application.
SparkContext :Object used to access the cluster.
Resilient distributed dataset (RDD) :Collection of elements partitioned across the nodes of the
cluster that can be operated on in parallel.

Operations on RDD
Apache Spark 4
Transformations : Returns another RDD
Action : Returns value.

Create a file spark_notes.txt with below
contents
Apache Spark 5
Apache Spark is an open source Big Data analytical framework.
RDD is the main abstraction in Apache Spark
Apache Spark can also be called as an unified engine.
Scala is programming and functional language.
Apache Spark is developed by using Scala programming language.
Lets start learning Apache Spark and become Data Scientist in Big Data Space.

RDD creation(Scala)
Apache Spark 6
1)
val rdd = sc.parallelize(List(1,2,3,4,5))
val multiply = rdd.map(x =>x*x)
multiply.collect()
2)
val textRdd = sc.textFile("/home/ubuntu/work/spark_notes.txt")
textRdd.first()

RDD creation(Python)
Apache Spark 7
1)
rdd = sc.parallelize([1,2,3,4,5])
multiply = rdd.map(lambda x :x*x)
multiply.collect()
2)
textRdd = sc.textFile("/home/ubuntu/work/spark_notes.txt")
textRdd.first()

Examples
Apache Spark 8
val lines = sc.textFile("/home/ubuntu/work/spark_notes.txt")
lines.count() // Count the number of items in this RDD
val sparkLines = lines.filter(line => line.contains("Spark"))
sparkLines.count()
val scalaLines = lines.filter(line => line.contains("Scala"))
scalaLines.count()

Word Count Example.
Apache Spark 9
val lines = sc.textFile("/home/ubuntu/work/spark_notes.txt")
val flatMapWords = lines.flatMap(line => line.split(" "))
flatMapWords.collect()
val wordwithOneNumber = flatMapWords.map(word => (word, 1))
val count =wordwithOneNumber.reduceByKey((x, y) => x + y)
count.collect()

FlatMap() and map()
Apache Spark 10
val lines = sc.parallelize(List("hello world","hello spark"))
val wordsFlatMap = lines.flatMap(line => line.split(" "))
wordsFlatMap.collect()
val wordsMap = lines.map(line => line.split(" "))
wordsMap.collect()

Custom Method
Apache Spark 11
def sp(n:String):Array[String] = {n.split(" ")}
val rdd = sc.parallelize(List("Apache spark","spark core","spark ml")
val words = rdd.flatMap(sp)
words.collect()
val words = rdd.map(sp)
words.collect()

Transformations & Actions
Apache Spark 12

Assignments
Apache Spark 13
Lets take List =1,2,3,4,5,1,2,3,1
Write code for below problems
1)Add each element by itselft for above list
2)add one number to each element in List
3)Filter 1 from of above list
4)top 10 words from a file
5)Take only words which are more than 4 chars from a file

Apache spark basics

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Apache spark basics (20)

Recently uploaded (20)

Apache spark basics