Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark

WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

Josh McNutt and Keria Bermúdez-Hernández
Showtime Networks Inc.
Data-Driven Transformation:
Leveraging Big Data at
SHOWTIME with Apache Spark
#UnifiedAnalytics #SparkAISummit

Premium cable network known for bold and wholly
unique programming
In 2018 SHOWTIME had four of the top six scripted
hour-long series on premium cable: HOMELAND,
SHAMELESS, RAY DONOVAN and BILLIONS
Launch of SHOWTIME standalone streaming
service in July 2015 made available an
unprecedented level of user-level data
Showtime Networks Inc.

- Viewing History
- Customer Journey Information
- CRM interactions
...
Billions of records
Extremely high level of granularity
Big Data at SHOWTIME
Direct relationship with our
customers
SHOWTIME standalone streaming service
Rich data captured on
every customer interaction:

What is the lifetime value of a subscriber who
signs up to watch Ray Donovan?
How many free trial signups have been
generated by Shameless?
What is the probability a subscriber will begin
watching Billions for the first time in the next 7
days?

Questions
Capacity to Answer

Strong data science skills needed to
interact directly with data lake
Research/Business
Analyst
Basic questions answered via
operational reporting
Complex questions required bespoke
analyses, sometimes taking weeks or
longer to develop
Raw
Data

Questions
Capacity to Answer
How to reduce this
gap?

Small team of data scientists working to…
Democratize data and analytics
Understand and predict subscriber behaviors
Support data-driven programming & scheduling
Data Strategy Team

We’re working to bring the data closer to the users and the users closer to the data
Business/Research
Analyst
Raw
Data
Augmenting Data
Supply Chain
Making the data
more suitable for
analysis
Democratizing Data and Analytics

Master Viewers Table
Foundational customer data
representation
Captures 1000s of metrics and
behaviors in Free Trial, Paid Month 1,
lifetime
Tracks relationship between each
user and series
Supports several use cases:
• machine learning
• reporting/dashboarding
• ad hoc analysis

We’re working to bring the data closer to the users and the users closer to the data
Business/Research
Analyst
Raw
Data
Augmenting Data
Supply Chain
Data Skills
Training
Creating conditions
leading to a chain reaction
of curiosity, exploration,
analysis and insight
Making the data
more suitable for
analysis
Tableau
SQL

We build actionable machine learning models to gain an intimate understanding of
how our users behave and how to strengthen their relationship to SHOWTIME
XGBoost
Random
Forests
Churn Propensity
Resubscription Propensity
Future Customer Value
Series Viewership Propensity
$
Understanding and Predicting Subscriber Behaviors
Put into production:

We are employing data and analytics to attribute revenue and subscribers to content
Free trial signups
Resubscribers
Reduction in Churn
among viewers
Renewal
Decisions
Schedule
Optimization
Subscriber
Growth
Revenue Growth
1 3 5 7 9 11
Churn % by Paid
Month
Mean subscriber
lifetime value (LTV)
Supporting Data-Driven Programming and Scheduling

How to stitch
everything
together?
FLASHBACK: In order to build this capability, we needed to
get basic infrastructure in place

• Team comprised exclusively of data scientists without a dedicated DevOps engineer
• Lot of time spent troubleshooting cluster configuration
• Launching clusters was slow and clumsy
• Debugging our bootstrap script (to install python libraries) was nontrivial
• Any software or hardware upgrades were scary and generally avoided for as long as
possible
#struggle
Configuration Challenges

Back to focusing on data science!
• Booting and managing clusters is now easy, fast and
reliable
• Simple to attach/detach notebooks to clusters
• Databricks handles installation of python libraries
• Easy to connect data sources to Tableau
• We can confidently experiment with new software or
hardware in an effort to improve our workflows
Databricks Unified Analytics Platform

Data Pipeline
19#UnifiedAnalytics #SparkAISummit

Optimizing Pipeline
Solutions
• Apache Airflow
• Code optimization tactics:
– Replacing RDD and Pandas
transformations with Pyspark
dataframe transformations
– Multithreading using futures
module

Apache Airflow
• Produced different tables by using
dbutils.notebook API
– While technically you can run notebooks
concurrently, complex dependencies and
concurrent jobs are harder to do in a
single notebook and job
job1
job2
job3
job4
job5
job6
Workflow in a single Notebook Workflow in Apache Airflow
job1
job2
job3
job4
job5
job6
dbutils.notebook.run("notebook-name", 60,
{"argument": "data", "argument2": "data2",
...})
notebook_task = DatabricksSubmitRunOperator(
task_id='notebook_task’,
dag=dag,
json=notebook_task_params)
• Apache Airflow was the solution for
scheduling and managing pipelines
– Airflow and Databricks integration
– Manages dependencies
– Task job can be run in a different cluster

Code Optimization Tactics
22
• User level RDD with subscriber status and viewership records
• Combinations of methods from custom classes

Code Optimization Tactics
25
• To reduce tasks complexity we saved intermediate
tables

Different Levels of Concurrency
26
BEFORE AFTER

Tracking Data Quality

Using Delta to Update Tables
• Without Delta:
– Ingestion daily
data
– Updating fields
– Rewriting data
3.03
0.47
0
0.5
1
1.5
2
2.5
3
3.5
Without Delta With Delta
Hours
MERGE INTO viewership AS hv
USING temp_tc AS t
ON hv.title = t.title
WHEN MATCHED AND hv.category IS NULL OR
hv.category != t.category THEN
UPDATE SET
hv.category = t.category
• With Delta
– Ingestion daily
data
– Updating fields
where necessary
– Rewriting data

Optimization Summary
• Apache Airflow
• Code optimizations
• MLflow for tracking data quality
• Delta for updating tables

Unified platform allows us to democratize data and
analytics
We continue to engage in ongoing innovation and
optimization
Exciting cultural shift
Final Thoughts

Questions?

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark

More Related Content

What's hot (20)

Similar to Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark (20)

More from Databricks (20)

Recently uploaded (20)

Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark