Learning spark ch09 - Spark SQL

C H A P T E R 0 9 : S P A R K S Q L .
Learning Spark
by Holden Karau et. al.

Overview: SPARK SQL
 Linking with Spark SQL
 Using Spark SQL in Applications
 Initializing Spark SQL
 Basic Query Example
 SchemaRDDs
 Caching
 Loading and Saving Data
 Apache Hive
 Parquet
 JSON
 From RDDs
 JDBC/ODBC Server
 Working with Beeline
 Long-Lived Tables and Queries
 User-Defined Functions
 Spark SQL UDFs
 Hive UDFs
 Spark SQL Performance
 Performance Tuning Options

9.1. Linking with Spark SQL
 Spark SQL can be built with or without Apache Hive, the Hadoop
SQL engine. Spark SQL with Hive support allows us to access Hive
tables, UDFs (user-defined functions), SerDes (serialization and
deserialization formats), and the Hive query language (HiveQL).
 Hive query language (HQL) It is important to note that including
the Hive libraries does not require an existing Hive installation. In
general, it is best to build Spark SQL with Hive support to access
these features. If you download Spark in binary form, it should
already be built with Hive support

9.2. Using Spark SQL in Applications
 The most powerful way to use Spark SQL is inside a Spark
application. This gives us the power to easily load data and query it
with SQL while simultaneously combining it with “regular” program
code in Python, Java, or Scala.
 To use Spark SQL this way, we construct a HiveContext (or
SQLContext for those wanting a stripped-down version) based on
our SparkContext. This context provides additional functions for
querying and interacting with Spark SQL data. Using the
HiveContext, we can build SchemaRDDs, which represent our
structure data, and operate on them with SQL or with normal RDD
operations like map().

Edx and Coursera Courses
 Introduction to Big Data with Apache Spark
 Spark Fundamentals I
 Functional Programming Principles in Scala

9.2.1 Initializing Spark SQL
 Scala SQL imports
 // Import Spark SQL
 import org.apache.spark.sql.hive.HiveContext
 // Or if you can't have the hive dependencies
 import org.apache.spark.sql.SQLContext
 Scala SQL implicits
 // Create a Spark SQL HiveContext
 val hiveCtx = ...
 // Import the implicit conversions
 import hiveCtx._
 Constructing a SQL context in Scala
 val sc = new SparkContext(...)
 val hiveCtx = new HiveContext(sc)

9.2.2 Basic Query Example
 To make a query against a table, we call the sql() method on
the HiveContext or SQLContext. The first thing we need to
do is tell Spark SQL about some data to query. In this case
we will load some Twitter data from JSON, and give it a
name by registering it as a “temporary table” so we can
query it with SQL
 Basic Query Example
 Loading and quering tweets in Scala
 val input = hiveCtx.jsonFile(inputFile)
 // Register the input schema RDD
 input.registerTempTable("tweets")
 // Select tweets based on the retweetCount
 val topTweets = hiveCtx.sql("SELECT text, retweetCount FROM
 tweets ORDER BY retweetCount LIMIT 10")

9.2.3 SchemaRDDs
 Both loading data and executing queries return SchemaRDDs.
SchemaRDDs are similar to tables in a traditional database. Under
the hood, a SchemaRDD is an RDD composed of Row objects with
additional schema information of the types in each column.
 Row objects are just wrappers around arrays of basic types (e.g.,
integers and strings), and we’ll cover them in more detail in the next
section.
 One important note: in future versions of Spark, the name
SchemaRDD may be changed to DataFrame. This renaming was still
under discussion as this book went to print. SchemaRDDs are also
regular RDDs, so you can operate on them using existing RDD
transformations like map() and filter(). However, they provide
several additional capabilities. Most importantly, you can register
any SchemaRDD as a temporary table to query it via
HiveContext.sql or SQLContext.sql. You do so using the
SchemaRDD’s registerTempTable() method,

SchemaRDDs
 Row objects represent records inside SchemaRDDs In
Scala/Java, Row objects have a number of getter
functions to obtain the value of each field given its index
 Accessing the text column (also first column) in the
topTweets SchemaRDD in Scala
 val topTweetText = topTweets.map(row => row.getString(0))

9.2.4 Caching
 Caching in Spark SQL works a bit differently. Since we
know the types of each column, Spark is able to more
efficiently store the data. To make sure that we cache
using the memory efficient representation, rather than
the full objects, we should use the special
hiveCtx.cacheTable("tableName") method. When
caching a table Spark SQL represents the data in an in-
memory columnar format. This cached table will remain
in memory only for the life of our driver program, so if it
exits we will need to recache our data. As with RDDs, we
cache tables when we expect to run multiple tasks or
queries against the same data.

9.3. Loading and Saving Data
 Spark SQL supports a number of structured data sources out of the
box, letting you get Row objects from them without any complicated
loading process. These sources include Hive tables, JSON, and
Parquet files. In addition, if you query these sources using SQL and
select only a subset of the fields, Spark SQL can smartly scan only
the subset of the data for those fields, instead of scanning all the
data like a naive SparkContext.hadoopFile might.
 Apart from these data sources, you can also convert regular RDDs in
your program to SchemaRDDs by assigning them a schema. This
makes it easy to write SQL queries even when your underlying data
is Python or Java objects. Often, SQL queries are more concise
when you’re computing many quantities at once (e.g., if you wanted
to compute the average age, max age, and count of distinct user IDs
in one pass). In addition, you can easily join these RDDs with
SchemaRDDs from any other SparkSQL data source. In this section,
we’ll cover the external sources as well as this way of using RDDs.

9.3.1 Apache Hive
 When loading data from Hive, Spark SQL supports
any Hive-supported storage formats(SerDes),
including text files, RCFiles, ORC, Parquet, Avro,
and Protocol Buffers.
 Hive load in Scala
 import org.apache.spark.sql.hive.HiveContext
 val hiveCtx = new HiveContext(sc)
 val rows = hiveCtx.sql("SELECT key, value FROM mytable")
 val keys = rows.map(row => row.getInt(0))

9.3.2 Parquet
 Parquet is a popular column-oriented storage format that can store
records with nested fields efficiently. It is often used with tools in
the Hadoop ecosystem, and it supports all of the data types in Spark
SQL. Spark SQL provides methods for reading data directly to and
from Parquet files.
 Parquet load in Python
 # Load some data in from a Parquet file with field's name and favouriteAnimal
 rows = hiveCtx.parquetFile(parquetFile)
 names = rows.map(lambda row: row.name)
 print "Everyone"
 print names.collect()
 Parquet query in Python
 # Find the panda lovers
 tbl = rows.registerTempTable("people")
 pandaFriends = hiveCtx.sql("SELECT name FROM people WHERE
favouriteAnimal

9.3.3 JSON
 If you have a JSON file with records fitting the same schema,
Spark SQL can infer the schema by scanning the file and let
you access fields by name
 If you have ever found yourself staring at a huge directory of
JSON records, Spark SQL’s schema inference is a very
effective way to start working with the data without writing
any special loading code.
 To load our JSON data, all we need to do is call the jsonFile()
function on our hiveCtx.
 Input records
 {"name": "Holden"}
 {"name":"Sparky The Bear", "lovesPandas":true,
"knows":{"friends": ["holden"]}}
 Loading JSON with Spark SQL in Scala
 val input = hiveCtx.jsonFile(inputFile)

9.3.3 JSON
 Resulting schema from printSchema()
 root
 |-- knows: struct (nullable = true)
 | |-- friends: array (nullable = true)
 | | |-- element: string (containsNull = false)
 |-- lovesPandas: boolean (nullable = true)
 |-- name: string (nullable = true)
 SQL query nested and array elements
 select hashtagEntities[0].text from tweets LIMIT 1;

9.3.4 From RDDs
 In addition to loading data, we can also create a SchemaRDD
from an RDD. In Scala,RDDs with case classes are implicitly
converted into SchemaRDDs.
 For Python we create an RDD of Row objects and then call
inferSchema().
 Creating a SchemaRDD from case class in Scala
 case class HappyPerson(handle: String, favouriteBeverage: String)
 ...
 // Create a person and turn it into a Schema RDD
 val happyPeopleRDD = sc.parallelize(List(HappyPerson("holden",
"coffee")))
 // Note: there is an implicit conversion
 // that is equivalent to sqlCtx.createSchemaRDD(happyPeopleRDD)
 happyPeopleRDD.registerTempTable("happy_people")

9.4. JDBC/ODBC Server
 Spark SQL also provides JDBC connectivity
 The JDBC server runs as a standalone Spark driver
program that can be shared by multiple clients. Any
client can cache tables in memory, query them, and so
on, and the cluster resources and cached data will be
shared among all of them
 Launching the JDBC server
 ./sbin/start-thriftserver.sh --master sparkMaster
 Connecting to the JDBC server with Beeline
 holden@hmbp2:~/repos/spark$ ./bin/beeline -u
jdbc:hive2://localhost:10000
 By default it listens on localhost:10000, but we can
change these with either environment variables

9.4.1 Working with Beeline
 Within the Beeline client, you can use standard
HiveQL commands to create, list, and query tables.
You can find the full details of HiveQL in the Hive
Language Manual.
 Load table
 CREATE TABLE IF NOT EXISTS mytable (key INT, value STRING)
 ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
 > LOAD DATA LOCAL INPATH 'learning-spark-examples/files/int_string.csv'
 INTO TABLE mytable;
 Show tables
 SHOW TABLES;
 mytable
 Time taken: 0.052 seconds

9.4.1 Working with Beeline
 Beeline makes it easy to view query plans. You can
run EXPLAIN on a given query to see what the
execution plan will be
 Spark SQL shell EXPLAIN
 spark-sql> EXPLAIN SELECT * FROM mytable where key = 1;
 == Physical Plan ==
 Filter (key#16 = 1)
 HiveTableScan [key#16,value#17], (MetastoreRelation default, mytable, None),
None
 Time taken: 0.551 seconds

9.4.2 Long-Lived Tables and Queries
 One of the advantages of using Spark SQL’s JDBC
server is we can share cached tables between
multiple programs. This is possible since the JDBC
Thrift server is a single driver program. To do this,
you only need to register the table and then run the
CACHE command on it, as shown in the previous
section.

9.5. User-Defined Functions
 User-defined functions, or UDFs, allow you to
register custom functions in Python, Java, and Scala
to call within SQL. They are a very popular way to
expose advanced functionality to SQL users in an
organization, so that these users can call into it
without writing code. Spark SQL makes it especially
easy to write UDFs. It supports both its own UDF
interface and existing Apache Hive UDFs.

9.5.1 Spark SQL UDFs
 Spark SQL offers a built-in method to easily register
UDFs by passing in a function in your programming
language. In Scala and Python, we can use the native
function and lambda syntax of the language, and in
Java we need only extend the appropriate UDF class.
Our UDFs can work on a variety of types, and we can
return a different type than the one we are called
with.
 Scala string length UDF
 registerFunction("strLenScala", (_: String).length)
 val tweetLength = hiveCtx.sql("SELECT strLenScala('tweet')
FROM tweets LIMIT 10")

9.5.2 Hive UDFs
 Spark SQL can also use existing Hive UDFs. The
standard Hive UDFs are already automatically included.
If you have a custom UDF, it is important to make sure
that the JARs for your UDF are included with your
application. If we run the JDBC server, note that we can
add this with the --jars command-line flag. Developing
Hive UDFs is beyond the scope of this book, so we will
instead introduce how to use existing Hive UDFs.
 Using a Hive UDF requires that we use the HiveContext
instead of a regular SQLContext. To make a Hive UDF
available, simply call hiveCtx.sql("CREATE
TEMPORARY FUNCTION name AS class.function").

9.6. Spark SQL Performance
 As alluded to in the introduction, Spark SQL’s
higher-level query language and additional type
information allows Spark SQL to be more efficient.
 Spark SQL is for more than just users who are
familiar with SQL. Spark SQL makes it very easy to
perform conditional aggregate operations, like
counting the sum of multiple columns without
having to construct special objects
 Spark SQL multiple sums
 SELECT SUM(user.favouritesCount), SUM(retweetCount),
user.id FROM tweets GROUP BY user.id

9.6.1 Performance Tuning Options
 Using the JDBC connector, and the Beeline shell, we can set these
performance options, and other options, with the set command
 Beeline command for enabling codegen
 beeline> set spark.sql.codegen=true;
 SET spark.sql.codegen=true
 spark.sql.codegen=true
 Time taken: 1.196 seconds
 Scala code for enabling codegen
 conf.set("spark.sql.codegen", "true“)
 First is spark.sql.codegen, which causes Spark SQL to compile each
query to Java bytecode before running it. Codegen can make long
queries or frequently repeated queries substantially faster
 The second option you may need to tune is
spark.sql.inMemoryColumnarStor age.batchSize. When caching
SchemaRDDs, Spark SQL groups together the records in the RDD in
batches of the size given by this option (default: 1000), and compresses
each batch

Learning spark ch09 - Spark SQL

More Related Content

What's hot (20)

Similar to Learning spark ch09 - Spark SQL (20)

More from phanleson (20)

Recently uploaded (20)

Learning spark ch09 - Spark SQL