Introduction to PySpark

A One Hour Introduction to Analytics with PySpark
Introduction to PySpark
http://www.slideshare.net/rjurney/introduction-to-pyspark
or
http://bit.ly/intro_to_pyspark

Agile Data Science 2.0
Russell Jurney
2
Data Engineer
Data Scientist
Visualization Software Engineer
85%
85%
85%
Writer
85%
Teacher
50%
Russell Jurney is a veteran data
scientist and thought leader. He
coined the term Agile Data Science in
the book of that name from O’Reilly
in 2012, which outlines the first agile
development methodology for data
science. Russell has constructed
numerous full-stack analytics
products over the past ten years and
now works with clients helping them
extract value from their data assets.
Russell Jurney
Skill
Principal Consultant at Data Syndrome
Russell Jurney
Data Syndrome, LLC
Email : russell.jurney@gmail.com
Web : datasyndrome.com
Principal Consultant

Building Full-Stack Data Analytics Applications with Spark
http://bit.ly/agile_data_science
Agile Data Science 2.0

Agile Data Science 2.0 4
Realtime Predictive
Analytics
Rapidly learn to build entire predictive systems driven by
Kafka, PySpark, Speak Streaming, Spark MLlib and with a web
front-end using Python/Flask and JQuery.
Available for purchase at http://datasyndrome.com/video

Lorem Ipsum dolor siamet suame this placeholder for text can simply
random text. It has roots in a piece of classical. variazioni deiwords which
whichhtly. ven on your zuniga merida della is not denis.
Product Consulting
We build analytics products and systems
consisting of big data viz, predictions,
recommendations, reports and search.
Corporate Training
We offer training courses for data
scientists and engineers and data
science teams,
Video Training
We offer video training courses that rapidly
acclimate you with a technology and
technique.

What is Spark? What makes it go?
Concepts

Data Syndrome: Agile Data Science 2.0
Hadoop / HDFS
HDFS splits large data among many machines
7

Hadoop / MapReduce
In the beginning there was MapReduce
8

Spark / RDD
Spark RDDs are iterable MapReduce relations
9

Spark / DataFrame
Fast SQLish RDD thingies
10

Spark Streaming
Spark on realtime streams in mini-batches
11

Spark Ecosystem
Lots of cool stuff working together…
12
/

Setting up our class environment
Setup

Data Syndrome: Agile Data Science 2.0 14
Python 3 > 2.7
While the break in compatibility between Python 2.X
and 3.X was unfortunate and unnecessary , Python 3
has increasingly become the platform of choice for
analytics work. With a few alterations, all code in this
course will execute in a Python 2.7 environment.

Virtualbox
Virtualbox is a Free and Open Source Software (FOSS)
virtualization product for AMD64/Intel64 processors. It
supports many operating systems, and is under active
development.
https://www.virtualbox.org/wiki/Downloads

Vagrant
Vagrant sits on top of Virtualbox and provides easy to
use, reproducible development environments.
https://www.vagrantup.com/downloads.html

Vagrant Setup
17
Initializing our Vagrant Environment
# Get the project
git clone https://github.com/rjurney/Agile_Data_Code_2/
# Setup and connect to our virtual machine
vagrant up; vagrant ssh
# Now, from within Vagrant
cd Agile_Data_Code_2
intro_download.sh
# See Appendix A and install.sh for manual install

Amazon EC2
Alternatively, Amazon Web Services provide a simple
way to launch a prepared image for use in this exercise.

EC2 Setup for Ubuntu Linux
19
Initializing our EC2 Environment
# See ec2.sh, which uses aws/ec2_bootstrap.sh
# To use add: —user-data file://aws/ec2_bootstrap.sh
# Get the project
git clone git@github.com:rjurney/Agile_Data_Code_2.git
# Setup AWS CLI tools
pip install awscli
# Edit and run r3.xlarge instance with your key
./ec2.sh
# ssh to the machine

20
# Contents of ec2.sh
# Launch our instance, which ec2_bootstrap.sh will initialize 
aws ec2 run-instances  
--image-id ami-4ae1fb5d  
--key-name agile_data_science  
--user-data file://aws/ec2_bootstrap.sh  
--instance-type r3.xlarge  
--ebs-optimized  
--placement "AvailabilityZone=us-east-1d"  
--block-device-mappings '{"DeviceName":"/dev/sda1","Ebs":
{"DeleteOnTermination":false,"VolumeSize":1024}}'  
--count 1

21
# Download the data
./intro_download.sh

22
# Download the data
./intro_download.sh

Documentation Setup
Opening the right web pages to answer your questions
23
http://spark.apache.org/docs/latest/api/python/pyspark.html
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html
http://spark.apache.org/docs/latest/api/python/pyspark.ml.html

Learning the basics of PySpark
Basic PySpark

Hello, World!
How to load data and perform an operation on it in Spark
25
# See ch02/spark.py
# Load the text file using the SparkContext 
csv_lines = sc.textFile("data/example.csv") 
 
# Map the data to split the lines into a list 
data = csv_lines.map(lambda line: line.split(",")) 
 
# Collect the dataset into local RAM 
data.collect()

Creating Objects from CSV using a function
How to create objects from CSV using a function instead of a lambda
26
# See ch02/groupby.py
 
# Turn the CSV lines into objects 
def csv_to_record(line): 
parts = line.split(",") 
record = { 
"name": parts[0], 
"company": parts[1], 
"title": parts[2] 
} 
return record 
 
# Apply the function to every record 
records = csv_lines.map(csv_to_record) 
 
# Inspect the first item in the dataset 
records.first()

Using a GroupBy to Count Jobs
Count things using the groupBy API
27
# Group the records by the name of the person 
grouped_records = records.groupBy(lambda x: x["name"]) 
 
# Show the first group 
grouped_records.first() 
 
# Count the groups 
job_counts = grouped_records.map( 
lambda x: { 
"name": x[0], 
"job_count": len(x[1]) 
} 
) 
 
job_counts.first() 
 
job_counts.collect()

Map vs FlatMap
Understanding the difference between these two operators
28
# See ch02/flatmap.py
 
# Compute a relation of words by line 
words_by_line = csv_lines 
.map(lambda line: line.split(",")) 
 
words_by_line.collect() 
 
# Compute a relation of words 
flattened_words = csv_lines 
.map(lambda line: line.split(",")) 
.flatMap(lambda x: x) 
 
flattened_words.collect()

Map vs FlatMap
Understanding the difference between these two operators
29
words_by_line.collect()
[['Russell Jurney', 'Relato', 'CEO'],
['Florian Liebert', 'Mesosphere', 'CEO'],
['Don Brown', 'Rocana', 'CIO'],
['Steve Jobs', 'Apple', 'CEO'],
['Donald Trump', 'The Trump Organization', 'CEO'],
['Russell Jurney', 'Data Syndrome', 'Principal Consultant']]
flattened_words.collect()
['Russell Jurney',
'Relato',
'CEO',
'Florian Liebert',
'Mesosphere',
'CEO',
'Don Brown',
'Rocana',
'CIO',
'Steve Jobs',
'Apple',
'CEO',
'Donald Trump',
'The Trump Organization',
'CEO',
'Russell Jurney',
'Data Syndrome',
'Principal Consultant']

Using DataFrames and Spark SQL to Count Jobs
Converting an RDD to a DataFrame to use Spark SQL
30
# See ch02/sql.py
 
from pyspark.sql import Row 
 
# Convert the CSV into a pyspark.sql.Row 
def csv_to_row(line): 
parts = line.split(",") 
row = Row( 
name=parts[0], 
company=parts[1], 
title=parts[2] 
) 
return row 
 
# Apply the function to get rows in an RDD 
rows = csv_lines.map(csv_to_row)

Using DataFrames and Spark SQL to Count Jobs
Converting an RDD to a DataFrame to use Spark SQL
31
# Convert to a pyspark.sql.DataFrame 
rows_df = rows.toDF() 
 
# Register the DataFrame for Spark SQL 
rows_df.registerTempTable("executives") 
 
# Generate a new DataFrame with SQL using the SparkSession 
job_counts = spark.sql("""
SELECT
name,
COUNT(*) AS total
FROM executives
GROUP BY name
""") 
job_counts.show() 
 
# Go back to an RDD 
job_counts.rdd.collect()

Working with a more complex dataset
Exploratory Data Analysis
with Airline Data

Loading a Parquet Columnar File
Using the Apache Parquet format to load columnar data
33
# See ch02/load_on_time_performance.py
# Load the parquet file containing flight delay records 
on_time_dataframe = spark.read.parquet('data/on_time_performance.parquet') 
 
# Register the data for Spark SQL 
on_time_dataframe.registerTempTable("on_time_performance") 
 
# Check out the columns 
on_time_dataframe.columns 
 
# Check out some data 
on_time_dataframe 
.select("FlightDate", "TailNum", "Origin", "Dest", "Carrier", "DepDelay", "ArrDelay") 
.show()

Sampling a DataFrame
Sampling a DataFrame to get a better view of its data
34
# Trim the fields and keep the result 
trimmed_on_time = on_time_dataframe 
.select( 
"FlightDate", 
"TailNum", 
"Origin", 
"Dest", 
"Carrier", 
"DepDelay", 
"ArrDelay" 
)
# Sample 0.01% of the data and show 
trimmed_on_time.sample(False, 0.0001).show()

Calculating a Histogram
Computing the distribution of a column in a dataset
35
# See ch02/histogram.py 
# Load the parquet file containing flight delay records 
 
# Register the data for Spark SQL 
 
# Compute a histogram of departure delays 
on_time_dataframe 
.select("DepDelay") 
.rdd 
.histogram(10)

Displaying a Histogram
Using pyplot to display a histogram
36
import numpy as np 
import matplotlib.mlab as mlab 
import matplotlib.pyplot as plt 
 
# Function to plot a histogram using pyplot 
def create_hist(rdd_histogram_data): 
"""Given an RDD.histogram, plot a pyplot histogram""" 
heights = np.array(rdd_histogram_data[1]) 
full_bins = rdd_histogram_data[0] 
mid_point_bins = full_bins[:-1] 
widths = [abs(i - j) for i, j in zip(full_bins[:-1], full_bins[1:])] 
bar = plt.bar(mid_point_bins, heights, width=widths, color='b') 
return bar 
 
# Compute a histogram of departure delays 
departure_delay_histogram = on_time_dataframe 
.select("DepDelay") 
.rdd 
.histogram(10, [-60,-30,-15,-10,-5,0,5,10,15,30,60,90,120,180]) 
 
create_hist(departure_delay_histogram)

Displaying a Histogram
Using pyplot to display a histogram
37

Counting Airplanes
How many airplanes are in the US fleet in total?
38
# See ch05/assess_airplanes.py
# Load the parquet file 
 
# Dump the unneeded fields 
tail_numbers = on_time_dataframe.rdd.map(lambda x: x.TailNum) 
tail_numbers = tail_numbers.filter(lambda x: x != '') 
 
# distinct() gets us unique tail numbers 
unique_tail_numbers = tail_numbers.distinct() 
 
# now we need a count() of unique tail numbers 
airplane_count = unique_tail_numbers.count() 
print("Total airplanes: {}".format(airplane_count))

Counting Total Flights by Month
Preparing data for a chart
39
# See ch05/total_flights.py
 
# Use SQL to look at the total flights by month across 2015 
on_time_dataframe.registerTempTable("on_time_dataframe") 
total_flights_by_month = spark.sql( 
"""SELECT Month, Year, COUNT(*) AS total_flights 
FROM on_time_dataframe 
GROUP BY Year, Month 
ORDER BY Year, Month""" 
) 
 
# This map/asDict trick makes the rows print a little prettier. It is optional. 
flights_chart_data = total_flights_by_month.rdd.map(lambda row: row.asDict()) 
flights_chart_data.collect()

Preparing Complex Records for Storage
Getting data ready for storage in a document or key/value store
40
# See ch05/extract_airplanes.py
 
# Filter down to the fields we need to identify and link to a flight 
flights = on_time_dataframe.rdd.map(lambda x:  
(x.Carrier, x.FlightDate, x.FlightNum, x.Origin, x.Dest, x.TailNum) 
) 
 
# Group flights by tail number, sorted by date, then flight number, then origin/dest 
flights_per_airplane = flights 
.map(lambda nameTuple: (nameTuple[5], [nameTuple[0:5]])) 
.reduceByKey(lambda a, b: a + b) 
.map(lambda tuple: 
{ 
'TailNum': tuple[0],  
'Flights': sorted(tuple[1], key=lambda x: (x[1], x[2], x[3], x[4])) 
} 
) 
flights_per_airplane.first()

Counting Flight Delays
Analyzing and understanding why flights are late
41
# See ch07/explore_delays.py
# Load the on-time parquet file 
 
total_flights = on_time_dataframe.count() 
 
# Flights that were late leaving... 
late_departures = on_time_dataframe.filter(on_time_dataframe.DepDelayMinutes > 0) 
total_late_departures = late_departures.count() 
 
# Flights that were late arriving... 
late_arrivals = on_time_dataframe.filter(on_time_dataframe.ArrDelayMinutes > 0) 
total_late_arrivals = late_arrivals.count()
# Get the percentage of flights that are late, rounded to 1 decimal place 
pct_late = round((total_late_arrivals / (total_flights * 1.0)) * 100, 1)

Hero Flights
How many flights made up for time in the air? Those that departed late and arrived on time?
42
# Flights that left late but made up time to arrive on time... 
on_time_heros = on_time_dataframe.filter( 
(on_time_dataframe.DepDelayMinutes > 0) 
& 
(on_time_dataframe.ArrDelayMinutes <= 0) 
) 
total_on_time_heros = on_time_heros.count()

Presenting Results
Displaying the answers in plaintext we’ve just calculated
43
print("Total flights: {:,}".format(total_flights)) 
print("Late departures: {:,}".format(total_late_departures)) 
print("Late arrivals: {:,}".format(total_late_arrivals)) 
print("Recoveries: {:,}".format(total_on_time_heros)) 
print("Percentage Late: {}%".format(pct_late))

44
# Get the average minutes late departing and arriving 
spark.sql(""" 
SELECT 
ROUND(AVG(DepDelay),1) AS AvgDepDelay, 
ROUND(AVG(ArrDelay),1) AS AvgArrDelay 
FROM on_time_performance 
""" 
).show()
Average Lateness Departing and Arriving
Drilling down into flights and how late they are…

Sampling Late Flights
Getting to know our data by sampling records of interest
45
# Why are flights late? Lets look at some delayed flights and the delay causes 
late_flights = spark.sql(""" 
SELECT 
ArrDelayMinutes, 
WeatherDelay, 
CarrierDelay, 
NASDelay, 
SecurityDelay, 
LateAircraftDelay 
FROM 
on_time_performance 
WHERE 
WeatherDelay IS NOT NULL 
OR 
CarrierDelay IS NOT NULL 
OR 
NASDelay IS NOT NULL 
OR 
SecurityDelay IS NOT NULL 
OR 
LateAircraftDelay IS NOT NULL 
ORDER BY 
FlightDate 
""") 
late_flights.sample(False, 0.01).show()

Why are Flights Late?
46
# Calculate the percentage contribution to delay for each source 
total_delays = spark.sql(""" 
SELECT 
ROUND(SUM(WeatherDelay)/SUM(ArrDelayMinutes) * 100, 1) AS pct_weather_delay, 
ROUND(SUM(CarrierDelay)/SUM(ArrDelayMinutes) * 100, 1) AS pct_carrier_delay, 
ROUND(SUM(NASDelay)/SUM(ArrDelayMinutes) * 100, 1) AS pct_nas_delay, 
ROUND(SUM(SecurityDelay)/SUM(ArrDelayMinutes) * 100, 1) AS pct_security_delay, 
ROUND(SUM(LateAircraftDelay)/SUM(ArrDelayMinutes) * 100, 1) AS pct_late_aircraft_delay 
FROM on_time_performance 
""") 
total_delays.show()

How Often are Weather Delayed Flights Late?
47
# Eyeball the first to define our buckets 
weather_delay_histogram = on_time_dataframe 
.select("WeatherDelay") 
.rdd 
.histogram([1, 5, 10, 15, 30, 60, 120, 240, 480, 720, 24*60.0]) 
print(weather_delay_histogram)
create_hist(weather_delay_histogram)

How Often are Weather Delayed Flights Late?
48

Preparing Histogram Data for d3.js
49
# Transform the data into something easily consumed by d3 
def histogram_to_publishable(histogram): 
record = {'key': 1, 'data': []} 
for label, value in zip(histogram[0], histogram[1]): 
record['data'].append( 
{ 
'label': label, 
'value': value 
} 
) 
return record
# Recompute the weather histogram with a filter for on-time flights 
weather_delay_histogram = on_time_dataframe 
.filter( 
(on_time_dataframe.WeatherDelay != None) 
& 
(on_time_dataframe.WeatherDelay > 0) 
) 
.select("WeatherDelay") 
.rdd 
.histogram([0, 15, 30, 60, 120, 240, 480, 720, 24*60.0]) 
print(weather_delay_histogram) 
 
record = histogram_to_publishable(weather_delay_histogram)

Building a classifier model
Predictive Analytics
Machine Learning

Download Prepared Training Data
Saving time by using a prepared dataset
51
# Be in the root directory of the project
# Run the download script
ch08/download_data.sh

String Vectorization
From properties of items to vector format
52

scikit-learn was 166. Spark MLlib is very powerful!
ch08/train_spark_mllib_model.py
190 Line Model
# !/usr/bin/env python 
 
import sys, os, re 
 
# Pass date and base path to main() from airflow 
def main(base_path): 
 
# Default to "." 
try: base_path 
except NameError: base_path = "." 
if not base_path: 
base_path = "." 
 
APP_NAME = "train_spark_mllib_model.py" 
 
# If there is no SparkSession, create the environment 
try: 
sc and spark 
except NameError as e: 
import findspark 
findspark.init() 
import pyspark 
import pyspark.sql 
 
sc = pyspark.SparkContext() 
spark = pyspark.sql.SparkSession(sc).builder.appName(APP_NAME).getOrCreate() 
 
# 
# { 
# "ArrDelay":5.0,"CRSArrTime":"2015-12-31T03:20:00.000-08:00","CRSDepTime":"2015-12-31T03:05:00.000-08:00", 
# "Carrier":"WN","DayOfMonth":31,"DayOfWeek":4,"DayOfYear":365,"DepDelay":14.0,"Dest":"SAN","Distance":368.0, 
# "FlightDate":"2015-12-30T16:00:00.000-08:00","FlightNum":"6109","Origin":"TUS" 
# } 
# 
from pyspark.sql.types import StringType, IntegerType, FloatType, DoubleType, DateType, TimestampType 
from pyspark.sql.types import StructType, StructField 
from pyspark.sql.functions import udf 
 
schema = StructType([ 
StructField("ArrDelay", DoubleType(), True), # "ArrDelay":5.0 
StructField("CRSArrTime", TimestampType(), True), # "CRSArrTime":"2015-12-31T03:20:00.000-08:00" 
StructField("CRSDepTime", TimestampType(), True), # "CRSDepTime":"2015-12-31T03:05:00.000-08:00" 
StructField("Carrier", StringType(), True), # "Carrier":"WN" 
StructField("DayOfMonth", IntegerType(), True), # "DayOfMonth":31 
StructField("DayOfWeek", IntegerType(), True), # "DayOfWeek":4 
StructField("DayOfYear", IntegerType(), True), # "DayOfYear":365 
StructField("DepDelay", DoubleType(), True), # "DepDelay":14.0 
StructField("Dest", StringType(), True), # "Dest":"SAN" 
StructField("Distance", DoubleType(), True), # "Distance":368.0 
StructField("FlightDate", DateType(), True), # "FlightDate":"2015-12-30T16:00:00.000-08:00" 
StructField("FlightNum", StringType(), True), # "FlightNum":"6109" 
StructField("Origin", StringType(), True), # "Origin":"TUS" 
]) 
 
input_path = "{}/data/simple_flight_delay_features.jsonl.bz2".format( 
base_path 
) 
features = spark.read.json(input_path, schema=schema) 
features.first() 
 
# 
# Check for nulls in features before using Spark ML 
# 
null_counts = [(column, features.where(features[column].isNull()).count()) for column in features.columns] 
cols_with_nulls = filter(lambda x: x[1] > 0, null_counts) 
print(list(cols_with_nulls)) 
 
# 
# Add a Route variable to replace FlightNum 
# 
from pyspark.sql.functions import lit, concat 
features_with_route = features.withColumn( 
'Route', 
concat( 
features.Origin, 
lit('-'), 
features.Dest 
) 
) 
features_with_route.show(6) 
 
# 
# Use pysmark.ml.feature.Bucketizer to bucketize ArrDelay into on-time, slightly late, very late (0, 1, 2) 
# 
from pyspark.ml.feature import Bucketizer 
 
# Setup the Bucketizer 
splits = [-float("inf"), -15.0, 0, 30.0, float("inf")] 
arrival_bucketizer = Bucketizer( 
splits=splits, 
inputCol="ArrDelay", 
outputCol="ArrDelayBucket" 
) 
 
# Save the bucketizer 
arrival_bucketizer_path = "{}/models/arrival_bucketizer_2.0.bin".format(base_path) 
arrival_bucketizer.write().overwrite().save(arrival_bucketizer_path) 
 
# Apply the bucketizer 
ml_bucketized_features = arrival_bucketizer.transform(features_with_route) 
ml_bucketized_features.select("ArrDelay", "ArrDelayBucket").show() 
 
# 
# Extract features tools in with pyspark.ml.feature 
# 
from pyspark.ml.feature import StringIndexer, VectorAssembler 
 
# Turn category fields into indexes 
for column in ["Carrier", "Origin", "Dest", "Route"]: 
string_indexer = StringIndexer( 
inputCol=column, 
outputCol=column + "_index" 
) 
 
string_indexer_model = string_indexer.fit(ml_bucketized_features) 
ml_bucketized_features = string_indexer_model.transform(ml_bucketized_features) 
 
# Drop the original column 
ml_bucketized_features = ml_bucketized_features.drop(column) 
 
# Save the pipeline model 
string_indexer_output_path = "{}/models/string_indexer_model_{}.bin".format( 
base_path, 
column 
) 
string_indexer_model.write().overwrite().save(string_indexer_output_path) 
 
# Combine continuous, numeric fields with indexes of nominal ones 
# ...into one feature vector 
numeric_columns = [ 
"DepDelay", "Distance", 
"DayOfMonth", "DayOfWeek", 
"DayOfYear"] 
index_columns = ["Carrier_index", "Origin_index", 
"Dest_index", "Route_index"] 
vector_assembler = VectorAssembler( 
inputCols=numeric_columns + index_columns, 
outputCol="Features_vec" 
) 
final_vectorized_features = vector_assembler.transform(ml_bucketized_features) 
 
# Save the numeric vector assembler 
vector_assembler_path = "{}/models/numeric_vector_assembler.bin".format(base_path) 
vector_assembler.write().overwrite().save(vector_assembler_path) 
 
# Drop the index columns 
for column in index_columns: 
final_vectorized_features = final_vectorized_features.drop(column) 
 
# Inspect the finalized features 
final_vectorized_features.show() 
 
# Instantiate and fit random forest classifier on all the data 
from pyspark.ml.classification import RandomForestClassifier 
rfc = RandomForestClassifier( 
featuresCol="Features_vec", 
labelCol="ArrDelayBucket", 
predictionCol="Prediction", 
maxBins=4657, 
maxMemoryInMB=1024 
) 
model = rfc.fit(final_vectorized_features) 
 
# Save the new model over the old one 
model_output_path = "{}/models/spark_random_forest_classifier.flight_delays.5.0.bin".format( 
base_path 
) 
model.write().overwrite().save(model_output_path) 
 
# Evaluate model using test data 
predictions = model.transform(final_vectorized_features) 
 
from pyspark.ml.evaluation import MulticlassClassificationEvaluator 
evaluator = MulticlassClassificationEvaluator( 
metricName="accuracy" 
) 
accuracy = evaluator.evaluate(predictions) 
print("Accuracy = {}".format(accuracy)) 
 
# Check the distribution of predictions 
predictions.groupBy("Prediction").count().show() 
 
# Check a sample 
predictions.sample(False, 0.001, 18).orderBy("CRSDepTime").show(6) 
 
if __name__ == "__main__": 
main(sys.argv[1])

Loading Our Training Data
Loading our data as a DataFrame to use the Spark ML APIs
54
 
]) 
 
features = spark.read.json( 
"data/simple_flight_delay_features.jsonl.bz2", 
schema=schema 
) 
features.first()

Checking the Data for Nulls
Nulls will cause problems hereafter, so detect and address them first
55
# 
# Check for nulls in features before using Spark ML 
# 
null_counts = [(column, features.where(features[column].isNull()).count()) for column in features.columns] 
cols_with_nulls = filter(lambda x: x[1] > 0, null_counts) 
print(list(cols_with_nulls))

Adding a Feature - The Route
Route is defined as origin airport code + “-“ + destination airport code
56
# 
# 
 
'Route', 
concat( 
features.Origin, 
lit('-'), 
features.Dest 
) 
) 
features_with_route.select("Origin", "Dest", "Route").show(5)

Bucketizing ArrDelay into ArrDelayBucket
We can’t classify a continuous variable, so we must bucketize it to make it nominal/categorical
57
# 
# Use pysmark.ml.feature.Bucketizer to bucketize ArrDelay 
# 
 
bucketizer = Bucketizer( 
splits=splits, 
) 
ml_bucketized_features = bucketizer.transform(features_with_route) 
 
# Check the buckets out 
ml_bucketized_features.select("ArrDelay", "ArrDelayBucket").show()

Indexing String Columns into Numeric Columns
Nominal/categorical/string columns need to be made numeric before we can vectorize them
58
# 
# 
 
# Turn category fields into categoric feature vectors, then drop intermediate fields 
for column in ["Carrier", "DayOfMonth", "DayOfWeek", "DayOfYear", 
"Origin", "Dest", "Route"]: 
inputCol=column, 
) 
ml_bucketized_features = string_indexer.fit(ml_bucketized_features) 
.transform(ml_bucketized_features) 
 
# Check out the indexes 
ml_bucketized_features.show(6)

Combining Numeric and Indexed Fields into One Vector
Our classifier needs a single field, so we combine all our numeric fields into one feature vector
59
# Handle continuous, numeric fields by combining them into one feature vector 
numeric_columns = ["DepDelay", "Distance"] 
index_columns = ["Carrier_index", "DayOfMonth_index", 
"DayOfWeek_index", "DayOfYear_index", "Origin_index", 
"Origin_index", "Dest_index", "Route_index"] 
) 
 
 
# Check out the features 
final_vectorized_features.show()

Splitting our Data in a Test/Train Split
We need to split our data to evaluate the performance of our classifier
60
# 
# Cross validate, train and evaluate classifier 
# 
 
# Test/train split 
training_data, test_data = final_vectorized_features.randomSplit([0.7, 0.3])

Training Our Model
This is the magic in machine learning, and it is only a couple of lines of code
61
# Instantiate and fit random forest classifier 
maxBins=4657, 
maxMemoryInMB=1024 
) 
model = rfc.fit(training_data)

Evaluating Our Model
Using the test/train split to evaluate our model for accuracy
62
predictions = model.transform(test_data) 
 
evaluator = MulticlassClassificationEvaluator(labelCol="ArrDelayBucket", metricName="accuracy") 
accuracy = evaluator.evaluate(predictions) 
print("Accuracy = {}".format(accuracy))

Sampling Our Predictions
Making sure they pass the sniff check
63
# Check a sample 
predictions.sample(False, 0.001, 18).orderBy("CRSDepTime").show(6)

Experiment Setup
Necessary to improve model
64

155 additional lines to setup an experiment
and add 3 new features to improvement the model
ch09/improve_spark_mllib_model.py
345 L.O.C.
# !/usr/bin/env python 
 
import sys, os, re 
import json 
import datetime, iso8601 
from tabulate import tabulate 
 
# Pass date and base path to main() from airflow 
def main(base_path): 
APP_NAME = "train_spark_mllib_model.py" 
 
# If there is no SparkSession, create the environment 
try: 
sc and spark 
except NameError as e: 
import findspark 
findspark.init() 
import pyspark 
import pyspark.sql 
 
sc = pyspark.SparkContext() 
spark = pyspark.sql.SparkSession(sc).builder.appName(APP_NAME).getOrCreate() 
 
# 
# { 
# "ArrDelay":5.0,"CRSArrTime":"2015-12-31T03:20:00.000-08:00","CRSDepTime":"2015-12-31T03:05:00.000-08:00", 
# "Carrier":"WN","DayOfMonth":31,"DayOfWeek":4,"DayOfYear":365,"DepDelay":14.0,"Dest":"SAN","Distance":368.0, 
# "FlightDate":"2015-12-30T16:00:00.000-08:00","FlightNum":"6109","Origin":"TUS" 
# } 
# 
 
]) 
 
input_path = "{}/data/simple_flight_delay_features.json".format( 
base_path 
) 
features = spark.read.json(input_path, schema=schema) 
features.first() 
 
# 
# 
'Route', 
concat( 
features.Origin, 
lit('-'), 
features.Dest 
) 
) 
features_with_route.show(6) 
 
# 
# Add the hour of day of scheduled arrival/departure 
# 
from pyspark.sql.functions import hour 
features_with_hour = features_with_route.withColumn( 
"CRSDepHourOfDay", 
hour(features.CRSDepTime) 
) 
features_with_hour = features_with_hour.withColumn( 
"CRSArrHourOfDay", 
hour(features.CRSArrTime) 
) 
features_with_hour.select("CRSDepTime", "CRSDepHourOfDay", "CRSArrTime", "CRSArrHourOfDay").show() 
 
# 
# Use pysmark.ml.feature.Bucketizer to bucketize ArrDelay into on-time, slightly late, very late (0, 1, 2) 
# 
 
# Setup the Bucketizer 
arrival_bucketizer = Bucketizer( 
splits=splits, 
) 
 
# Save the model 
arrival_bucketizer_path = "{}/models/arrival_bucketizer_2.0.bin".format(base_path) 
arrival_bucketizer.write().overwrite().save(arrival_bucketizer_path) 
 
# Apply the model 
ml_bucketized_features = arrival_bucketizer.transform(features_with_hour) 
ml_bucketized_features.select("ArrDelay", "ArrDelayBucket").show() 
 
# 
# 
 
# Turn category fields into indexes 
for column in ["Carrier", "Origin", "Dest", "Route"]: 
inputCol=column, 
) 
 
string_indexer_model = string_indexer.fit(ml_bucketized_features) 
ml_bucketized_features = string_indexer_model.transform(ml_bucketized_features) 
# Save the pipeline model 
string_indexer_output_path = "{}/models/string_indexer_model_3.0.{}.bin".format( 
base_path, 
column 
) 
string_indexer_model.write().overwrite().save(string_indexer_output_path) 
 
# Combine continuous, numeric fields with indexes of nominal ones 
# ...into one feature vector 
numeric_columns = [ 
"DepDelay", "Distance", 
"DayOfMonth", "DayOfWeek", 
"DayOfYear", "CRSDepHourOfDay", 
"CRSArrHourOfDay"] 
index_columns = ["Carrier_index", "Origin_index", 
"Dest_index", "Route_index"] 
) 
 
# Save the numeric vector assembler 
vector_assembler_path = "{}/models/numeric_vector_assembler_3.0.bin".format(base_path) 
vector_assembler.write().overwrite().save(vector_assembler_path) 
 
 
# Inspect the finalized features 
final_vectorized_features.show() 
 
# 
# Cross validate, train and evaluate classifier: loop 5 times for 4 metrics 
# 
 
from collections import defaultdict 
scores = defaultdict(list) 
feature_importances = defaultdict(list) 
metric_names = ["accuracy", "weightedPrecision", "weightedRecall", "f1"] 
split_count = 3 
 
for i in range(1, split_count + 1): 
print("nRun {} out of {} of test/train splits in cross validation...".format( 
i, 
split_count, 
) 
) 
 
# Test/train split 
training_data, test_data = final_vectorized_features.randomSplit([0.8, 0.2]) 
 
# Instantiate and fit random forest classifier on all the data 
maxBins=4657, 
) 
model = rfc.fit(training_data) 
 
# Save the new model over the old one 
model_output_path = "{}/models/spark_random_forest_classifier.flight_delays.baseline.bin".format( 
base_path 
) 
model.write().overwrite().save(model_output_path) 
 
predictions = model.transform(test_data) 
 
# Evaluate this split's results for each metric 
for metric_name in metric_names: 
evaluator = MulticlassClassificationEvaluator( 
metricName=metric_name 
) 
score = evaluator.evaluate(predictions) 
 
scores[metric_name].append(score) 
print("{} = {}".format(metric_name, score)) 
 
# 
# Collect feature importances 
# 
feature_names = vector_assembler.getInputCols() 
feature_importance_list = model.featureImportances 
for feature_name, feature_importance in zip(feature_names, feature_importance_list): 
feature_importances[feature_name].append(feature_importance) 
 
# 
# Evaluate average and STD of each metric and print a table 
# 
import numpy as np 
score_averages = defaultdict(float) 
 
# Compute the table data 
average_stds = [] # ha 
metric_scores = scores[metric_name] 
 
average_accuracy = sum(metric_scores) / len(metric_scores) 
score_averages[metric_name] = average_accuracy 
 
std_accuracy = np.std(metric_scores) 
 
average_stds.append((metric_name, average_accuracy, std_accuracy)) 
 
# Print the table 
print("nExperiment Log") 
print("--------------") 
print(tabulate(average_stds, headers=["Metric", "Average", "STD"])) 
 
# 
# Persist the score to a sccore log that exists between runs 
# 
import pickle
# Load the score log or initialize an empty one 
try: 
score_log_filename = "{}/models/score_log.pickle".format(base_path) 
score_log = pickle.load(open(score_log_filename, "rb")) 
if not isinstance(score_log, list): 
score_log = [] 
except IOError: 
score_log = [] 
 
# Compute the existing score log entry 
score_log_entry = {metric_name: score_averages[metric_name] for metric_name in metric_names} 
 
# Compute and display the change in score for each metric 
try: 
last_log = score_log[-1] 
except (IndexError, TypeError, AttributeError): 
last_log = score_log_entry 
 
experiment_report = [] 
run_delta = score_log_entry[metric_name] - last_log[metric_name] 
experiment_report.append((metric_name, run_delta)) 
 
print("nExperiment Report") 
print("-----------------") 
print(tabulate(experiment_report, headers=["Metric", "Score"])) 
 
# Append the existing average scores to the log 
score_log.append(score_log_entry) 
 
# Persist the log for next run 
pickle.dump(score_log, open(score_log_filename, "wb")) 
 
# 
# Analyze and report feature importance changes 
# 
 
# Compute averages for each feature 
feature_importance_entry = defaultdict(float) 
for feature_name, value_list in feature_importances.items(): 
average_importance = sum(value_list) / len(value_list) 
feature_importance_entry[feature_name] = average_importance 
 
# Sort the feature importances in descending order and print 
import operator 
sorted_feature_importances = sorted( 
feature_importance_entry.items(), 
key=operator.itemgetter(1), 
reverse=True 
) 
 
print("nFeature Importances") 
print("-------------------") 
print(tabulate(sorted_feature_importances, headers=['Name', 'Importance'])) 
 
# 
# Compare this run's feature importances with the previous run's 
# 
 
# Load the feature importance log or initialize an empty one 
try: 
feature_log_filename = "{}/models/feature_log.pickle".format(base_path) 
feature_log = pickle.load(open(feature_log_filename, "rb")) 
if not isinstance(feature_log, list): 
feature_log = [] 
except IOError: 
feature_log = [] 
 
# Compute and display the change in score for each feature 
try: 
last_feature_log = feature_log[-1] 
except (IndexError, TypeError, AttributeError): 
last_feature_log = defaultdict(float) 
for feature_name, importance in feature_importance_entry.items(): 
last_feature_log[feature_name] = importance 
 
# Compute the deltas 
feature_deltas = {} 
for feature_name in feature_importances.keys(): 
run_delta = feature_importance_entry[feature_name] - last_feature_log[feature_name] 
feature_deltas[feature_name] = run_delta 
 
# Sort feature deltas, biggest change first 
import operator 
sorted_feature_deltas = sorted( 
feature_deltas.items(), 
key=operator.itemgetter(1), 
reverse=True 
) 
 
# Display sorted feature deltas 
print("nFeature Importance Delta Report") 
print("-------------------------------") 
print(tabulate(sorted_feature_deltas, headers=["Feature", "Delta"])) 
 
# Append the existing average deltas to the log 
feature_log.append(feature_importance_entry) 
 
# Persist the log for next run 
pickle.dump(feature_log, open(feature_log_filename, "wb")) 
 
if __name__ == "__main__": 
main(sys.argv[1])

Data Syndrome Russell Jurney
Principal Consultant
Email : rjurney@datasyndrome.com
Web : datasyndrome.com
Data Syndrome, LLC

Introduction to PySpark

More Related Content

What's hot(20)

Similar to Introduction to PySpark(20)

More from Russell Jurney(9)

Recently uploaded(20)

Introduction to PySpark