Quick reference for spark sql

********************************************************************************
*****************************************************
DataFrames - DataFrame Operations
********************************************************************************
*****************************************************
SQLContext
==========
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
--------------------------------------------------------------------------------
------------------------------------------------------
Ex:employee.json
{
{"id" : "1201", "name" : "satish", "age" : "25"}
{"id" : "1202", "name" : "krishna", "age" : "28"}
{"id" : "1203", "name" : "amith", "age" : "39"}
{"id" : "1204", "name" : "javed", "age" : "23"}
{"id" : "1205", "name" : "prudvi", "age" : "23"}
}
--------------------------------------------------------------------------------
------------------------------------------------------
Copy the file to hdfs
=====================
hadoop dfs -put employee.json
--------------------------------------------------------------------------------
------------------------------------------------------
Read the JSON Document
======================
val dfs = sqlContext.read.json("employee.json")
--------------------------------------------------------------------------------
------------------------------------------------------
Show the Data
=============
dfs.show()
--------------------------------------------------------------------------------
------------------------------------------------------
Use printSchema Method
======================
dfs.printSchema()
--------------------------------------------------------------------------------
------------------------------------------------------
Use Select Method
=================
dfs.select("name").show()
--------------------------------------------------------------------------------
------------------------------------------------------
Use Age Filter
==============
dfs.filter(dfs("age") > 23).show()
--------------------------------------------------------------------------------
------------------------------------------------------
Use groupBy Method
==================
dfs.groupBy("age").count().show()
********************************************************************************
*****************************************************
DataFrames - Inferring the Schema using Reflection
********************************************************************************
*****************************************************
Ex:employee.txt
1201, satish, 25
1202, krishna, 28
1203, amith, 39
1204, javed, 23
1205, prudvi, 23

--------------------------------------------------------------------------------
------------------------------------------------------
Create SQLContext
=================
--------------------------------------------------------------------------------
------------------------------------------------------
Import SQL Functions
====================
import sqlContext.implicits._
--------------------------------------------------------------------------------
------------------------------------------------------
Create Case Class
=================
case class Employee(id: Int, name: String, age: Int)
--------------------------------------------------------------------------------
------------------------------------------------------
Create RDD and Apply Transformations
====================================
val empl=sc.textFile("employee.txt").map(_.split(",")).map(e=>
Employee(e(0).trim.toInt,e(1), e(2).trim.toInt)).toDF()
--------------------------------------------------------------------------------
------------------------------------------------------
Store the DataFrame Data in a Table
===================================
empl.registerTempTable("employee")
--------------------------------------------------------------------------------
------------------------------------------------------
Select Query on DataFrame
=========================
val allrecords = sqlContext.sql("SELECT * FROM employee")
verification:
allrecords.show()
--------------------------------------------------------------------------------
------------------------------------------------------
Where Clause SQL Query on DataFrame
===================================
val agefilter = sqlContext.sql("SELECT * FROM employee WHERE age>=20 AND age <=
35")
verification:
agefilter.show()
--------------------------------------------------------------------------------
------------------------------------------------------
Fetch ID values from agefilter DataFrame using column index-
============================================================
agefilter.map(t=>"ID: "+t(0)).collect().foreach(println)
********************************************************************************
*****************************************************
DataFrames - Programmatically Specifying the
Schema
********************************************************************************
*****************************************************
Ex:employee.txt
1201, satish, 25
1202, krishna, 28
1203, amith, 39
1204, javed, 23
1205, prudvi, 23
--------------------------------------------------------------------------------
------------------------------------------------------
Create SQLContext Object
========================
--------------------------------------------------------------------------------

------------------------------------------------------
Read Input from Text File
=========================
val employee = sc.textFile("employee.txt")
--------------------------------------------------------------------------------
------------------------------------------------------
Create an Encoded Schema in a String Format
===========================================
val schemaString = "id name age"
--------------------------------------------------------------------------------
------------------------------------------------------
Import Respective APIs
======================
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType, StructField, StringType};
--------------------------------------------------------------------------------
------------------------------------------------------
Generate Schema
===============
val schema = StructType(schemaString.split(" ").map(fieldName =>
StructField(fieldName, StringType, true)))
--------------------------------------------------------------------------------
------------------------------------------------------
Apply Transformation for Reading Data from Text File
====================================================
val rowRDD = employee.map(_.split(",")).map(e => Row(e(0).trim.toInt, e(1),
e(2).trim.toInt))
--------------------------------------------------------------------------------
------------------------------------------------------
Apply RowRDD in Row Data based on Schema
========================================
val employeeDF = sqlContext.createDataFrame(rowRDD, schema)
--------------------------------------------------------------------------------
------------------------------------------------------
Store DataFrame Data into Table
===============================
employeeDF.registerTempTable("employee")
--------------------------------------------------------------------------------
------------------------------------------------------
=========================
verification:
allrecords.show()
********************************************************************************
*****************************************************
Data Sources - JSON Datasets
********************************************************************************
*****************************************************
Ex:employee.json
{
{"id" : "1201", "name" : "satish", "age" : "25"}
{"id" : "1202", "name" : "krishna", "age" : "28"}
{"id" : "1203", "name" : "amith", "age" : "39"}
{"id" : "1204", "name" : "javed", "age" : "23"}
{"id" : "1205", "name" : "prudvi", "age" : "23"}
}
--------------------------------------------------------------------------------
------------------------------------------------------
Read JSON Document
==================
val dfs = sqlContext.read.json("employee.json")
--------------------------------------------------------------------------------
------------------------------------------------------

Use printSchema Method
======================
dfs.printSchema()
--------------------------------------------------------------------------------
------------------------------------------------------
Show the data
=============
dfs.show()
********************************************************************************
*****************************************************
Data Sources - Hive Tables
(not working since dependencies missing)
********************************************************************************
*****************************************************
Ex:employee.txt
1201, satish, 25
1202, krishna, 28
1203, amith, 39
1204, javed, 23
1205, prudvi, 23
--------------------------------------------------------------------------------
------------------------------------------------------
========================
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
--------------------------------------------------------------------------------
------------------------------------------------------
Create Table using HiveQL
=========================
sqlContext.sql("CREATE TABLE IF NOT EXISTS employee(id INT, name STRING, age
INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY 'n'")
--------------------------------------------------------------------------------
------------------------------------------------------
Load Data into Table using HiveQL
=================================
sqlContext.sql("LOAD DATA LOCAL INPATH 'employee.txt' INTO TABLE employee")
--------------------------------------------------------------------------------
------------------------------------------------------
Select Fields from the Table
============================
val result = sqlContext.sql("FROM employe SELECT id, name, age")
verification:
result.show()
********************************************************************************
*****************************************************
Data Sources - Parquet Files
********************************************************************************
*****************************************************
========================
--------------------------------------------------------------------------------
------------------------------------------------------
Read JSON Document
==================
val employee = sqlContext.read.json("employee.json")
--------------------------------------------------------------------------------
------------------------------------------------------
Convert JSON to Parquet
=======================
employee.write.parquet("employee.parquet")
________________________________________________________________________________
____________________________________________________

========================
--------------------------------------------------------------------------------
------------------------------------------------------
Read Input from Text File
=========================
val parqfile = sqlContext.read.parquet("employee.parquet")
--------------------------------------------------------------------------------
------------------------------------------------------
Store the DataFrame into the Table
==================================
parqfile.registerTempTable("employee")
--------------------------------------------------------------------------------
------------------------------------------------------
=========================
verification:
allrecords.show()

Quick reference for spark sql

More Related Content

What's hot (17)

Viewers also liked (20)

Similar to Quick reference for spark sql (20)

More from Rajkumar Asohan, PMP (8)

Recently uploaded (20)

Quick reference for spark sql