SlideShare a Scribd company logo
********************************************************************************
*****************************************************
DataFrames - DataFrame Operations
********************************************************************************
*****************************************************
SQLContext
==========
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
--------------------------------------------------------------------------------
------------------------------------------------------
Ex:employee.json
{
{"id" : "1201", "name" : "satish", "age" : "25"}
{"id" : "1202", "name" : "krishna", "age" : "28"}
{"id" : "1203", "name" : "amith", "age" : "39"}
{"id" : "1204", "name" : "javed", "age" : "23"}
{"id" : "1205", "name" : "prudvi", "age" : "23"}
}
--------------------------------------------------------------------------------
------------------------------------------------------
Copy the file to hdfs
=====================
hadoop dfs -put employee.json
--------------------------------------------------------------------------------
------------------------------------------------------
Read the JSON Document
======================
val dfs = sqlContext.read.json("employee.json")
--------------------------------------------------------------------------------
------------------------------------------------------
Show the Data
=============
dfs.show()
--------------------------------------------------------------------------------
------------------------------------------------------
Use printSchema Method
======================
dfs.printSchema()
--------------------------------------------------------------------------------
------------------------------------------------------
Use Select Method
=================
dfs.select("name").show()
--------------------------------------------------------------------------------
------------------------------------------------------
Use Age Filter
==============
dfs.filter(dfs("age") > 23).show()
--------------------------------------------------------------------------------
------------------------------------------------------
Use groupBy Method
==================
dfs.groupBy("age").count().show()
********************************************************************************
*****************************************************
DataFrames - Inferring the Schema using Reflection
********************************************************************************
*****************************************************
Ex:employee.txt
1201, satish, 25
1202, krishna, 28
1203, amith, 39
1204, javed, 23
1205, prudvi, 23
--------------------------------------------------------------------------------
------------------------------------------------------
Create SQLContext
=================
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
--------------------------------------------------------------------------------
------------------------------------------------------
Import SQL Functions
====================
import sqlContext.implicits._
--------------------------------------------------------------------------------
------------------------------------------------------
Create Case Class
=================
case class Employee(id: Int, name: String, age: Int)
--------------------------------------------------------------------------------
------------------------------------------------------
Create RDD and Apply Transformations
====================================
val empl=sc.textFile("employee.txt").map(_.split(",")).map(e=>
Employee(e(0).trim.toInt,e(1), e(2).trim.toInt)).toDF()
--------------------------------------------------------------------------------
------------------------------------------------------
Store the DataFrame Data in a Table
===================================
empl.registerTempTable("employee")
--------------------------------------------------------------------------------
------------------------------------------------------
Select Query on DataFrame
=========================
val allrecords = sqlContext.sql("SELECT * FROM employee")
verification:
allrecords.show()
--------------------------------------------------------------------------------
------------------------------------------------------
Where Clause SQL Query on DataFrame
===================================
val agefilter = sqlContext.sql("SELECT * FROM employee WHERE age>=20 AND age <=
35")
verification:
agefilter.show()
--------------------------------------------------------------------------------
------------------------------------------------------
Fetch ID values from agefilter DataFrame using column index-
============================================================
agefilter.map(t=>"ID: "+t(0)).collect().foreach(println)
********************************************************************************
*****************************************************
DataFrames - Programmatically Specifying the
Schema
********************************************************************************
*****************************************************
Ex:employee.txt
1201, satish, 25
1202, krishna, 28
1203, amith, 39
1204, javed, 23
1205, prudvi, 23
--------------------------------------------------------------------------------
------------------------------------------------------
Create SQLContext Object
========================
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
--------------------------------------------------------------------------------
------------------------------------------------------
Read Input from Text File
=========================
val employee = sc.textFile("employee.txt")
--------------------------------------------------------------------------------
------------------------------------------------------
Create an Encoded Schema in a String Format
===========================================
val schemaString = "id name age"
--------------------------------------------------------------------------------
------------------------------------------------------
Import Respective APIs
======================
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType, StructField, StringType};
--------------------------------------------------------------------------------
------------------------------------------------------
Generate Schema
===============
val schema = StructType(schemaString.split(" ").map(fieldName =>
StructField(fieldName, StringType, true)))
--------------------------------------------------------------------------------
------------------------------------------------------
Apply Transformation for Reading Data from Text File
====================================================
val rowRDD = employee.map(_.split(",")).map(e => Row(e(0).trim.toInt, e(1),
e(2).trim.toInt))
--------------------------------------------------------------------------------
------------------------------------------------------
Apply RowRDD in Row Data based on Schema
========================================
val employeeDF = sqlContext.createDataFrame(rowRDD, schema)
--------------------------------------------------------------------------------
------------------------------------------------------
Store DataFrame Data into Table
===============================
employeeDF.registerTempTable("employee")
--------------------------------------------------------------------------------
------------------------------------------------------
Select Query on DataFrame
=========================
val allrecords = sqlContext.sql("SELECT * FROM employee")
verification:
allrecords.show()
********************************************************************************
*****************************************************
Data Sources - JSON Datasets
********************************************************************************
*****************************************************
Ex:employee.json
{
{"id" : "1201", "name" : "satish", "age" : "25"}
{"id" : "1202", "name" : "krishna", "age" : "28"}
{"id" : "1203", "name" : "amith", "age" : "39"}
{"id" : "1204", "name" : "javed", "age" : "23"}
{"id" : "1205", "name" : "prudvi", "age" : "23"}
}
--------------------------------------------------------------------------------
------------------------------------------------------
Read JSON Document
==================
val dfs = sqlContext.read.json("employee.json")
--------------------------------------------------------------------------------
------------------------------------------------------
Use printSchema Method
======================
dfs.printSchema()
--------------------------------------------------------------------------------
------------------------------------------------------
Show the data
=============
dfs.show()
********************************************************************************
*****************************************************
Data Sources - Hive Tables
(not working since dependencies missing)
********************************************************************************
*****************************************************
Ex:employee.txt
1201, satish, 25
1202, krishna, 28
1203, amith, 39
1204, javed, 23
1205, prudvi, 23
--------------------------------------------------------------------------------
------------------------------------------------------
Create SQLContext Object
========================
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
--------------------------------------------------------------------------------
------------------------------------------------------
Create Table using HiveQL
=========================
sqlContext.sql("CREATE TABLE IF NOT EXISTS employee(id INT, name STRING, age
INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY 'n'")
--------------------------------------------------------------------------------
------------------------------------------------------
Load Data into Table using HiveQL
=================================
sqlContext.sql("LOAD DATA LOCAL INPATH 'employee.txt' INTO TABLE employee")
--------------------------------------------------------------------------------
------------------------------------------------------
Select Fields from the Table
============================
val result = sqlContext.sql("FROM employe SELECT id, name, age")
verification:
result.show()
********************************************************************************
*****************************************************
Data Sources - Parquet Files
********************************************************************************
*****************************************************
Create SQLContext Object
========================
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
--------------------------------------------------------------------------------
------------------------------------------------------
Read JSON Document
==================
val employee = sqlContext.read.json("employee.json")
--------------------------------------------------------------------------------
------------------------------------------------------
Convert JSON to Parquet
=======================
employee.write.parquet("employee.parquet")
________________________________________________________________________________
____________________________________________________
Create SQLContext Object
========================
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
--------------------------------------------------------------------------------
------------------------------------------------------
Read Input from Text File
=========================
val parqfile = sqlContext.read.parquet("employee.parquet")
--------------------------------------------------------------------------------
------------------------------------------------------
Store the DataFrame into the Table
==================================
parqfile.registerTempTable("employee")
--------------------------------------------------------------------------------
------------------------------------------------------
Select Query on DataFrame
=========================
val allrecords = sqlContext.sql("SELECT * FROM employee")
verification:
allrecords.show()
Create SQLContext Object
========================
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
--------------------------------------------------------------------------------
------------------------------------------------------
Read Input from Text File
=========================
val parqfile = sqlContext.read.parquet("employee.parquet")
--------------------------------------------------------------------------------
------------------------------------------------------
Store the DataFrame into the Table
==================================
parqfile.registerTempTable("employee")
--------------------------------------------------------------------------------
------------------------------------------------------
Select Query on DataFrame
=========================
val allrecords = sqlContext.sql("SELECT * FROM employee")
verification:
allrecords.show()

More Related Content

What's hot (17)

PDF
Of Haystacks And Needles
ZendCon
 
TXT
Command
gajshield
 
PDF
Automatic B Day Remainder Program
Arulalan T
 
ODP
Common schema my sql uc 2012
Roland Bouman
 
PDF
Introduction to tibbles
Rsquared Academy
 
PDF
MariaDB for developers
Colin Charles
 
PDF
Poodr ch8-composition
orga shih
 
PDF
Oracle NOLOGGING
Franck Pachot
 
ODP
Msql
ksujitha
 
KEY
実践 memcached
Masahiro Nagano
 
PDF
Writing Readable Code with Pipes
Rsquared Academy
 
PDF
Data Wrangling with dplyr
Rsquared Academy
 
PDF
20070920 Highload2007 Training Performance Momjian
Nikolay Samokhvalov
 
PDF
Explore Data using dplyr
Rsquared Academy
 
ODP
MySQL Scaling Presentation
Tommy Falgout
 
PDF
Become a super modeler
Patrick McFadin
 
Of Haystacks And Needles
ZendCon
 
Command
gajshield
 
Automatic B Day Remainder Program
Arulalan T
 
Common schema my sql uc 2012
Roland Bouman
 
Introduction to tibbles
Rsquared Academy
 
MariaDB for developers
Colin Charles
 
Poodr ch8-composition
orga shih
 
Oracle NOLOGGING
Franck Pachot
 
Msql
ksujitha
 
実践 memcached
Masahiro Nagano
 
Writing Readable Code with Pipes
Rsquared Academy
 
Data Wrangling with dplyr
Rsquared Academy
 
20070920 Highload2007 Training Performance Momjian
Nikolay Samokhvalov
 
Explore Data using dplyr
Rsquared Academy
 
MySQL Scaling Presentation
Tommy Falgout
 
Become a super modeler
Patrick McFadin
 

Viewers also liked (20)

PPTX
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
StampedeCon
 
DOCX
full.empty
Eteri Rustamova
 
DOCX
El manejo de word
shendry jaramillo
 
PDF
صلاح الدين نساء
Nour Elbader
 
PDF
ة
Nour Elbader
 
PDF
How to build the best exit team (& what it will cost)
BuyAndSellABusiness.com
 
PPTX
11. Усходнеславянскія плямёны на тэрыторыі Беларусі
AnastasiyaF
 
DOCX
Esquema comparativo entre power point y prezi
shendry jaramillo
 
PPTX
Find your Passion at HUB International
Rene Critelli
 
PDF
نتيجة الصف السادس الابتدائي ادارة المنيا
Nour Elbader
 
PPTX
Different fringe style bandage dresses
lee shin
 
PDF
Portfolio - Current - 090115-1
marta o'conner
 
PPTX
Uso de comandos insert, update y delete en bases de datos de sql server
ingrid garcia
 
PDF
20140327-S602-Mobile
Russell Lewis
 
PDF
Budgetlaw2015 2016
Nour Elbader
 
PPTX
Профессор Олаф Скупин: Можно ли научиться правильному ходу? Можно ли научитьс...
Фонд Вера
 
PPTX
22. Горад у ІХ-сярэдзіне ХІІІ ст.
AnastasiyaF
 
PDF
Opportunistic Persistent Data Storage
Luke Weerasooriya
 
PDF
Estatuto Sincongel
Fabricio Fontes
 
PDF
Noor
Nour Elbader
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
StampedeCon
 
full.empty
Eteri Rustamova
 
El manejo de word
shendry jaramillo
 
صلاح الدين نساء
Nour Elbader
 
How to build the best exit team (& what it will cost)
BuyAndSellABusiness.com
 
11. Усходнеславянскія плямёны на тэрыторыі Беларусі
AnastasiyaF
 
Esquema comparativo entre power point y prezi
shendry jaramillo
 
Find your Passion at HUB International
Rene Critelli
 
نتيجة الصف السادس الابتدائي ادارة المنيا
Nour Elbader
 
Different fringe style bandage dresses
lee shin
 
Portfolio - Current - 090115-1
marta o'conner
 
Uso de comandos insert, update y delete en bases de datos de sql server
ingrid garcia
 
20140327-S602-Mobile
Russell Lewis
 
Budgetlaw2015 2016
Nour Elbader
 
Профессор Олаф Скупин: Можно ли научиться правильному ходу? Можно ли научитьс...
Фонд Вера
 
22. Горад у ІХ-сярэдзіне ХІІІ ст.
AnastasiyaF
 
Opportunistic Persistent Data Storage
Luke Weerasooriya
 
Estatuto Sincongel
Fabricio Fontes
 
Ad

Similar to Quick reference for spark sql (20)

PDF
SparkSQL and Dataframe
Namgee Lee
 
PDF
Introduction to dataset
datamantra
 
PDF
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
PDF
Spark手把手:[e2-spk-s02]
Erhwen Kuo
 
PDF
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Allice Shandler
 
PDF
No more struggles with Apache Spark workloads in production
Chetan Khatri
 
PDF
Spark Schema For Free with David Szakallas
Databricks
 
PPTX
An Introduct to Spark - Atlanta Spark Meetup
jlacefie
 
PPTX
An Introduction to Spark
jlacefie
 
PDF
Using spark data frame for sql
DaeMyung Kang
 
PDF
Tulsa techfest Spark Core Aug 5th 2016
Mark Smith
 
PDF
pyspark_df.pdf
SJain36
 
PDF
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
PDF
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Comsysto Reply GmbH
 
PDF
Reading Cassandra Meetup Feb 2015: Apache Spark
Christopher Batey
 
PDF
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
PDF
Manchester Hadoop Meetup: Spark Cassandra Integration
Christopher Batey
 
PDF
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Chetan Khatri
 
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
PDF
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
 
SparkSQL and Dataframe
Namgee Lee
 
Introduction to dataset
datamantra
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Spark手把手:[e2-spk-s02]
Erhwen Kuo
 
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Allice Shandler
 
No more struggles with Apache Spark workloads in production
Chetan Khatri
 
Spark Schema For Free with David Szakallas
Databricks
 
An Introduct to Spark - Atlanta Spark Meetup
jlacefie
 
An Introduction to Spark
jlacefie
 
Using spark data frame for sql
DaeMyung Kang
 
Tulsa techfest Spark Core Aug 5th 2016
Mark Smith
 
pyspark_df.pdf
SJain36
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Comsysto Reply GmbH
 
Reading Cassandra Meetup Feb 2015: Apache Spark
Christopher Batey
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
Manchester Hadoop Meetup: Spark Cassandra Integration
Christopher Batey
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Chetan Khatri
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
 
Ad

More from Rajkumar Asohan, PMP (8)

PDF
Quick reference for Grafana
Rajkumar Asohan, PMP
 
TXT
Quick reference for kafka
Rajkumar Asohan, PMP
 
TXT
Quick reference for zookeeper commands
Rajkumar Asohan, PMP
 
TXT
Quick reference for sqoop
Rajkumar Asohan, PMP
 
TXT
Quick reference for hql
Rajkumar Asohan, PMP
 
TXT
Quick reference for HBase shell commands
Rajkumar Asohan, PMP
 
TXT
Quick reference for curl
Rajkumar Asohan, PMP
 
TXT
Quick reference for cql
Rajkumar Asohan, PMP
 
Quick reference for Grafana
Rajkumar Asohan, PMP
 
Quick reference for kafka
Rajkumar Asohan, PMP
 
Quick reference for zookeeper commands
Rajkumar Asohan, PMP
 
Quick reference for sqoop
Rajkumar Asohan, PMP
 
Quick reference for hql
Rajkumar Asohan, PMP
 
Quick reference for HBase shell commands
Rajkumar Asohan, PMP
 
Quick reference for curl
Rajkumar Asohan, PMP
 
Quick reference for cql
Rajkumar Asohan, PMP
 

Recently uploaded (20)

PDF
community health nursing question paper 2.pdf
Prince kumar
 
PPTX
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
PDF
Isharyanti-2025-Cross Language Communication in Indonesian Language
Neny Isharyanti
 
PPTX
STAFF DEVELOPMENT AND WELFARE: MANAGEMENT
PRADEEP ABOTHU
 
PPTX
Growth and development and milestones, factors
BHUVANESHWARI BADIGER
 
PPTX
Unit 2 COMMERCIAL BANKING, Corporate banking.pptx
AnubalaSuresh1
 
PPTX
ASRB NET 2023 PREVIOUS YEAR QUESTION PAPER GENETICS AND PLANT BREEDING BY SAT...
Krashi Coaching
 
PDF
The History of Phone Numbers in Stoke Newington by Billy Thomas
History of Stoke Newington
 
PDF
Chapter-V-DED-Entrepreneurship: Institutions Facilitating Entrepreneurship
Dayanand Huded
 
PDF
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - GLOBAL SUCCESS - CẢ NĂM - NĂM 2024 (VOCABULARY, ...
Nguyen Thanh Tu Collection
 
PDF
Generative AI: it's STILL not a robot (CIJ Summer 2025)
Paul Bradshaw
 
PPT
Talk on Critical Theory, Part One, Philosophy of Social Sciences
Soraj Hongladarom
 
PPTX
I AM MALALA The Girl Who Stood Up for Education and was Shot by the Taliban...
Beena E S
 
PDF
Biological Bilingual Glossary Hindi and English Medium
World of Wisdom
 
PDF
Dimensions of Societal Planning in Commonism
StefanMz
 
PDF
ARAL-Orientation_Morning-Session_Day-11.pdf
JoelVilloso1
 
PPTX
How to Create a PDF Report in Odoo 18 - Odoo Slides
Celine George
 
PPTX
MENINGITIS: NURSING MANAGEMENT, BACTERIAL MENINGITIS, VIRAL MENINGITIS.pptx
PRADEEP ABOTHU
 
PPSX
HEALTH ASSESSMENT (Community Health Nursing) - GNM 1st Year
Priyanshu Anand
 
PDF
CONCURSO DE POESIA “POETUFAS – PASSOS SUAVES PELO VERSO.pdf
Colégio Santa Teresinha
 
community health nursing question paper 2.pdf
Prince kumar
 
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
Isharyanti-2025-Cross Language Communication in Indonesian Language
Neny Isharyanti
 
STAFF DEVELOPMENT AND WELFARE: MANAGEMENT
PRADEEP ABOTHU
 
Growth and development and milestones, factors
BHUVANESHWARI BADIGER
 
Unit 2 COMMERCIAL BANKING, Corporate banking.pptx
AnubalaSuresh1
 
ASRB NET 2023 PREVIOUS YEAR QUESTION PAPER GENETICS AND PLANT BREEDING BY SAT...
Krashi Coaching
 
The History of Phone Numbers in Stoke Newington by Billy Thomas
History of Stoke Newington
 
Chapter-V-DED-Entrepreneurship: Institutions Facilitating Entrepreneurship
Dayanand Huded
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - GLOBAL SUCCESS - CẢ NĂM - NĂM 2024 (VOCABULARY, ...
Nguyen Thanh Tu Collection
 
Generative AI: it's STILL not a robot (CIJ Summer 2025)
Paul Bradshaw
 
Talk on Critical Theory, Part One, Philosophy of Social Sciences
Soraj Hongladarom
 
I AM MALALA The Girl Who Stood Up for Education and was Shot by the Taliban...
Beena E S
 
Biological Bilingual Glossary Hindi and English Medium
World of Wisdom
 
Dimensions of Societal Planning in Commonism
StefanMz
 
ARAL-Orientation_Morning-Session_Day-11.pdf
JoelVilloso1
 
How to Create a PDF Report in Odoo 18 - Odoo Slides
Celine George
 
MENINGITIS: NURSING MANAGEMENT, BACTERIAL MENINGITIS, VIRAL MENINGITIS.pptx
PRADEEP ABOTHU
 
HEALTH ASSESSMENT (Community Health Nursing) - GNM 1st Year
Priyanshu Anand
 
CONCURSO DE POESIA “POETUFAS – PASSOS SUAVES PELO VERSO.pdf
Colégio Santa Teresinha
 

Quick reference for spark sql

  • 1. ******************************************************************************** ***************************************************** DataFrames - DataFrame Operations ******************************************************************************** ***************************************************** SQLContext ========== val sqlContext = new org.apache.spark.sql.SQLContext(sc) -------------------------------------------------------------------------------- ------------------------------------------------------ Ex:employee.json { {"id" : "1201", "name" : "satish", "age" : "25"} {"id" : "1202", "name" : "krishna", "age" : "28"} {"id" : "1203", "name" : "amith", "age" : "39"} {"id" : "1204", "name" : "javed", "age" : "23"} {"id" : "1205", "name" : "prudvi", "age" : "23"} } -------------------------------------------------------------------------------- ------------------------------------------------------ Copy the file to hdfs ===================== hadoop dfs -put employee.json -------------------------------------------------------------------------------- ------------------------------------------------------ Read the JSON Document ====================== val dfs = sqlContext.read.json("employee.json") -------------------------------------------------------------------------------- ------------------------------------------------------ Show the Data ============= dfs.show() -------------------------------------------------------------------------------- ------------------------------------------------------ Use printSchema Method ====================== dfs.printSchema() -------------------------------------------------------------------------------- ------------------------------------------------------ Use Select Method ================= dfs.select("name").show() -------------------------------------------------------------------------------- ------------------------------------------------------ Use Age Filter ============== dfs.filter(dfs("age") > 23).show() -------------------------------------------------------------------------------- ------------------------------------------------------ Use groupBy Method ================== dfs.groupBy("age").count().show() ******************************************************************************** ***************************************************** DataFrames - Inferring the Schema using Reflection ******************************************************************************** ***************************************************** Ex:employee.txt 1201, satish, 25 1202, krishna, 28 1203, amith, 39 1204, javed, 23 1205, prudvi, 23
  • 2. -------------------------------------------------------------------------------- ------------------------------------------------------ Create SQLContext ================= val sqlContext = new org.apache.spark.sql.SQLContext(sc) -------------------------------------------------------------------------------- ------------------------------------------------------ Import SQL Functions ==================== import sqlContext.implicits._ -------------------------------------------------------------------------------- ------------------------------------------------------ Create Case Class ================= case class Employee(id: Int, name: String, age: Int) -------------------------------------------------------------------------------- ------------------------------------------------------ Create RDD and Apply Transformations ==================================== val empl=sc.textFile("employee.txt").map(_.split(",")).map(e=> Employee(e(0).trim.toInt,e(1), e(2).trim.toInt)).toDF() -------------------------------------------------------------------------------- ------------------------------------------------------ Store the DataFrame Data in a Table =================================== empl.registerTempTable("employee") -------------------------------------------------------------------------------- ------------------------------------------------------ Select Query on DataFrame ========================= val allrecords = sqlContext.sql("SELECT * FROM employee") verification: allrecords.show() -------------------------------------------------------------------------------- ------------------------------------------------------ Where Clause SQL Query on DataFrame =================================== val agefilter = sqlContext.sql("SELECT * FROM employee WHERE age>=20 AND age <= 35") verification: agefilter.show() -------------------------------------------------------------------------------- ------------------------------------------------------ Fetch ID values from agefilter DataFrame using column index- ============================================================ agefilter.map(t=>"ID: "+t(0)).collect().foreach(println) ******************************************************************************** ***************************************************** DataFrames - Programmatically Specifying the Schema ******************************************************************************** ***************************************************** Ex:employee.txt 1201, satish, 25 1202, krishna, 28 1203, amith, 39 1204, javed, 23 1205, prudvi, 23 -------------------------------------------------------------------------------- ------------------------------------------------------ Create SQLContext Object ======================== val sqlContext = new org.apache.spark.sql.SQLContext(sc) --------------------------------------------------------------------------------
  • 3. ------------------------------------------------------ Read Input from Text File ========================= val employee = sc.textFile("employee.txt") -------------------------------------------------------------------------------- ------------------------------------------------------ Create an Encoded Schema in a String Format =========================================== val schemaString = "id name age" -------------------------------------------------------------------------------- ------------------------------------------------------ Import Respective APIs ====================== import org.apache.spark.sql.Row; import org.apache.spark.sql.types.{StructType, StructField, StringType}; -------------------------------------------------------------------------------- ------------------------------------------------------ Generate Schema =============== val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true))) -------------------------------------------------------------------------------- ------------------------------------------------------ Apply Transformation for Reading Data from Text File ==================================================== val rowRDD = employee.map(_.split(",")).map(e => Row(e(0).trim.toInt, e(1), e(2).trim.toInt)) -------------------------------------------------------------------------------- ------------------------------------------------------ Apply RowRDD in Row Data based on Schema ======================================== val employeeDF = sqlContext.createDataFrame(rowRDD, schema) -------------------------------------------------------------------------------- ------------------------------------------------------ Store DataFrame Data into Table =============================== employeeDF.registerTempTable("employee") -------------------------------------------------------------------------------- ------------------------------------------------------ Select Query on DataFrame ========================= val allrecords = sqlContext.sql("SELECT * FROM employee") verification: allrecords.show() ******************************************************************************** ***************************************************** Data Sources - JSON Datasets ******************************************************************************** ***************************************************** Ex:employee.json { {"id" : "1201", "name" : "satish", "age" : "25"} {"id" : "1202", "name" : "krishna", "age" : "28"} {"id" : "1203", "name" : "amith", "age" : "39"} {"id" : "1204", "name" : "javed", "age" : "23"} {"id" : "1205", "name" : "prudvi", "age" : "23"} } -------------------------------------------------------------------------------- ------------------------------------------------------ Read JSON Document ================== val dfs = sqlContext.read.json("employee.json") -------------------------------------------------------------------------------- ------------------------------------------------------
  • 4. Use printSchema Method ====================== dfs.printSchema() -------------------------------------------------------------------------------- ------------------------------------------------------ Show the data ============= dfs.show() ******************************************************************************** ***************************************************** Data Sources - Hive Tables (not working since dependencies missing) ******************************************************************************** ***************************************************** Ex:employee.txt 1201, satish, 25 1202, krishna, 28 1203, amith, 39 1204, javed, 23 1205, prudvi, 23 -------------------------------------------------------------------------------- ------------------------------------------------------ Create SQLContext Object ======================== val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) -------------------------------------------------------------------------------- ------------------------------------------------------ Create Table using HiveQL ========================= sqlContext.sql("CREATE TABLE IF NOT EXISTS employee(id INT, name STRING, age INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY 'n'") -------------------------------------------------------------------------------- ------------------------------------------------------ Load Data into Table using HiveQL ================================= sqlContext.sql("LOAD DATA LOCAL INPATH 'employee.txt' INTO TABLE employee") -------------------------------------------------------------------------------- ------------------------------------------------------ Select Fields from the Table ============================ val result = sqlContext.sql("FROM employe SELECT id, name, age") verification: result.show() ******************************************************************************** ***************************************************** Data Sources - Parquet Files ******************************************************************************** ***************************************************** Create SQLContext Object ======================== val sqlContext = new org.apache.spark.sql.SQLContext(sc) -------------------------------------------------------------------------------- ------------------------------------------------------ Read JSON Document ================== val employee = sqlContext.read.json("employee.json") -------------------------------------------------------------------------------- ------------------------------------------------------ Convert JSON to Parquet ======================= employee.write.parquet("employee.parquet") ________________________________________________________________________________ ____________________________________________________
  • 5. Create SQLContext Object ======================== val sqlContext = new org.apache.spark.sql.SQLContext(sc) -------------------------------------------------------------------------------- ------------------------------------------------------ Read Input from Text File ========================= val parqfile = sqlContext.read.parquet("employee.parquet") -------------------------------------------------------------------------------- ------------------------------------------------------ Store the DataFrame into the Table ================================== parqfile.registerTempTable("employee") -------------------------------------------------------------------------------- ------------------------------------------------------ Select Query on DataFrame ========================= val allrecords = sqlContext.sql("SELECT * FROM employee") verification: allrecords.show()
  • 6. Create SQLContext Object ======================== val sqlContext = new org.apache.spark.sql.SQLContext(sc) -------------------------------------------------------------------------------- ------------------------------------------------------ Read Input from Text File ========================= val parqfile = sqlContext.read.parquet("employee.parquet") -------------------------------------------------------------------------------- ------------------------------------------------------ Store the DataFrame into the Table ================================== parqfile.registerTempTable("employee") -------------------------------------------------------------------------------- ------------------------------------------------------ Select Query on DataFrame ========================= val allrecords = sqlContext.sql("SELECT * FROM employee") verification: allrecords.show()