SlideShare a Scribd company logo
Be A Hero: Transforming
GoPro Analytics Data Pipeline
Machine Learning Innovation Summit , 2017
Chester Chen
ABOUT SPEAKER
• Head of Data Science &
Engineering
• Prev. Director of Engineering,
Alpine Data Labs
• Start to play with Spark since
Spark 0.6 version.
• Spoke at Hadoop Summit, Big
Data Scala, IEEE Big Data
Conferences
• Organizer of SF Big Analytics
meetup (6800+ members)
AGENDA
• Business Use Cases
• Data Platform Architecture
• Old Data Platforms: Pro & Cons and Challenges
• New Data Platform Architeture and Initiatives
• Adding Data Schema During Ingestion (Dynamic DDL)
Business Use Cases
GROWING DATA NEEDS FROM GOPRO ECOSYSTEM
GROWING DATA NEEDS FROM GOPRO ECOSYSTEM
DATA
Analytics
Platform
Consumer Devices GoPro Apps
E-Commerce
Social Media/OTT
3rd party data
Product Insight
CRM/Marketing/
Personalization
User segmentation
DATA PLATFORM CHALLENGES
Monitoring
Data
Quality
Enable
Predictive
Analytics
Cost
Scalability
Data Platform Architecture
Transformation
OLD DATA PLATFORM ARCHITECTURE
ETL Cluster
•File dumps (Json,
CSV)
• Spark Jobs
•Hive
Secure Data Mart
•End User Query
•Impala / Sentry
•Parquet
Analytics Apps
•HUE
•Tableau
Real Time Cluster
• Log file streaming
• Kafka
• Spark
• HBase
Induction
Framework
• Batch Ingestion
• Pre-processing
• Scheduled download
Rest API,
FTP
S3 sync
Streaming ingestion
Batch Ingestion
STREAMING PIPELINE
Streaming Cluster
ELBHTTP
Pipeline for processing of streaming logs
To Batch ETL Cluster
SPARK STREAMING PIPELINE
/path1/…
/path2/…
/path3/…
ToBatch
ETLCluster
/path4/…
OLD BATCH DATA PIPELINE
ETL Cluster
HDFS
HIVE Metastore
To SDM Cluster
From Streaming Pipeline
Pull
distcp
Hard-code Hive SQL based
predefined schema to load
Json transform parquet
and load to Hive
Map-Reduce Jobs tend to fail
Map-Reduce Jobs tend to fail
HDFS
HIVE Metastore
distcp
Aggregation
Hard-coded SQL
OLD ANALYTICS CLUSTER
HDFS
HIVE
Metastore
BI Reporting
SDM
From Batch Cluster
Exploratory Analytics
with Hue: Impala/Hive: SQL
Kerberos
distcp
3rd Party Service
PROS AND CONS OF OLD ARCHITECTURE
• Isolation of workloads
• Fast ingest
• Loosely coupled clusters
• Secure analytics cluster
• Multiple copies of data
• Tightly coupled storage and
compute
• Lack of elasticity
• Operational overhead of multiple
clusters
• Hard-coded batch Hive SQL not
flexible to change
• Multiple Hive meta stores
• distcp across clusters can take a long
time with increase of data volume
PROS CONS
PROS AND CONS OF OLD ARCHITECTURE
• Not easy to scale
• Storage and compute cost
• Only have SQL interface, no predictive
analytics tool
• Not easy to adapt data schema changes
CONS
New Infrastructure
KEY INITIATIVES: INFRASTRUCTURE
• Separate Compute and Storage
• Move storage to S3
• Centralize Hive Metadata
• Use ephemeral instance as compute cluster
• Simplify the ETL ingestion process and eliminate the distcp
• Elasticity
• auto-scale compute cluster (expand & shrink based on demand)
• Enhance Analytics Capabilities
• introducing Notebook
• Scala, Python, R etc.
• AWS Cost Reduction
• Reduce EBS storage cost
• Dynamic DDL
• add schema on the fly
DATA PLATFORM ARCHITECTURE
Real Time
Cluster
•Log file streaming
•Kafka
•Spark
Batch Ingestion
Framework
•Batch Ingestion
•Pre-processing
Streaming ingestion
Batch Ingestion
S3
CLUSTERS
HIVE
METASTORE
PLOT.LY SERVER
TABLEAU SERVER
EXTERNAL SERFVICE
Notebook
Rest API,
FTP
S3 sync,etc
Parquet
+
DDL
State Sync
OLAP
Aggregation
NEW DYNAMIC DDL ARCHITECTURE
Streaming Pipeline
ELBHTTP
Pipeline for processing of streaming logs
S3
HIVE
METASTORE
transition
Centralized Hive Meta Store
DATA PLATFORM ARCHITECTURE
Batch Pipeline
pull
S3
3rd Party Service
export
Centralized Hive Meta Store
S3
HIVE
METASTORE
Ingestion/Aggregation/Snapshot
with dynamic DDL
State sync
transition
ANALYTICS ARCHITECTURE – IN PROGRESS
BI Reporting/Visualization
Exploratory/Predictive
Analytics
Spark SQL/Scala/python/R
Hive
Metastore
DSE SELF-
SERVICE
PORTAL
OLAP
Aggregation
Dynamic DDL: Adding
Schema to Data on the fly
WHAT IS DYNAMIC DDL?
• Dynamically alter table and add column
{
{ “userId”, “123”}
{“eventId”, “abc”}
}
Flattened Columns
record_userId, record_eventN
Updated Table X
A B C userId erventId
a b c 123 abc
A B C
a b c
Existing Table X
WHY USE DYNAMIC DDL?
• Reduce development time
• Traditionally, adding new Event/Attribute/Column requires of a lot time among
teams
• Many Hive ETL SQL needs to be changed to every column changes.
• One way to solve this problem is to use key-value pair table
• Ingestion is easy, no changes needed for newly added event/attribute/column
• Hard for Analytics, tabulated data are much easier to work with
• Dynamical DDL
• Automatically flatten attributes (for json data)
• Turn data into columns
DYNAMIC DDL – CREATE TABLE
// manually create table due to Spark bug
def createTable(sqlContext: SQLContext, columns: Seq[(String, String)],
destInfo: OutputInfo, partitionColumns: Array[(ColumnDef, Column)]): DataFrame = {
val partitionClause = if (partitionColumns.length == 0) "" else {
s"""PARTITIONED BY (${partitionColumns.map(f => s"${f._1.name} ${f._1.`type`}").mkString(", ")})"""
}
val sqlStmt =
s"""CREATE TABLE IF NOT EXISTS ${destInfo.tableName()} ( columns.map(f => s"${f._1} ${f._2}").mkString(", "))
$partitionClause
STORED AS ${destInfo.destFormat.split('.').last}
""".stripMargin
//spark 2.x doesn't know create if not exists syntax,
// still log AlreadyExistsException message. but no exception
sqlContext.sql(sqlStmt)
}
DYNAMIC DDL – ALTER TABLE ADD COLUMNS
//first find existing fields, then add new fields
val tableDf = sqlContext.table(dbTableName)
val exisingFields : Seq[StructField] = …
val newFields: Seq[StructField] = …
if (newFields.nonEmpty) {
// spark 2.x bug https://issues.apache.org/jira/browse/SPARK-19261
val sqlStmt: String = s"""ALTER TABLE $dbTableName ADD COLUMNS ( ${newFields.map ( f =>
s"${f.name} ${f.dataType.typeName}” ).mkString(", ")}. )"""
}
DYNAMIC DDL – ALTER TABLE ADD COLUMNS (SPARK 2.0)
//Hack for Spark 2.0, Spark 2.1
if (newFields.nonEmpty) {
// spark 2.x bug https://issues.apache.org/jira/browse/SPARK-19261
alterTable(sqlContext, dbTableName, newFields)
}
def alterTable(sqlContext: SQLContext,
tableName: String,
newColumns: Seq[StructField]): Unit = {
alterTable(sqlContext, getTableIdentifier(tableName), newColumns)
}
private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configuration)
extends ExternalCatalog with Logging {
….
}
DYNAMIC DDL – PREPARE DATAFRAME
// Reorder the columns in the incoming data frame to match the order in
the destination table. Project all columns from the table
modifiedDF = modifiedDF.select(tableDf.schema.fieldNames.map(
f => {
if (modifiedDF.columns.contains(f)) col(f) else lit(null).as(f)
}): _*)
// Coalesce the data frame into the desired number of partitions (files).
// avoid too many partitions
modifiedDF.coalesce(ioInfo.outputInfo.numberOfPartition)
DYNAMIC DDL – BATCH SPECIFIC ISSUES
• Issue 1 : Several log files are mapped into same table, and not all
columns are present
CSV file 1
A B
A B C
CSV file 2
A X Y
Destination Table
Table Writer
B C
DYNAMIC DDL – BATCH SPECIFIC ISSUES
• Solution:
• Find DataFrame with max number of columns, use it as base, and reorder
columns against this DataFrame
val newDfs : Option[ParSeq[DataFrame]] = maxLengthDF.map{ baseDf =>
dfs.map { df =>
df.select(baseDf.schema.fieldNames.map(f => if (df.columns.contains(f)) col(f) else
lit(null).as(f)): _*)
}
}
DYNAMIC DDL – BATCH SPECIFIC ISSUES
• Issue2 : Too many log files -- performance
• Solution: We consolidate several data log files Data Frame into chunks, each
chunk with all Data Frames union together.
val ys: Seq[Seq[DataFrame]] = destTableDFs.seq.grouped(mergeChunkSize).toSeq
val dfs: ParSeq[DataFrame] = ys.par.map(p => p.foldLeft(emptyDF) { (z, a) => z.unionAll(a) })
dfs.foreach(saveDataFrame(info, _))
SUMMARY
SUMMARY
•GoPro Data Platform is in transition and we just get started
•Central Hive Meta store + S3  separate storage +
computing, reduce cost
•Introducing cloud computing for elasticity and reduce
operation complexity
•Leverage dynamic DDL for flexible ingestion, aggregation
and snapshot for both batch and streaming
PG #
RC Playbook: Your guide to
success at GoPro
Questions?

More Related Content

What's hot (20)

PDF
Data Migration with Spark to Hive
Databricks
 
PDF
PDI data vault framework #pcmams 2012
Jos van Dongen
 
PPTX
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
Michael Rys
 
PPTX
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
Michael Rys
 
PPTX
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
Michael Rys
 
PPTX
Discardable In-Memory Materialized Queries With Hadoop
Julian Hyde
 
PPTX
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Michael Rys
 
PPTX
Apache Hive
Abhishek Gautam
 
PPTX
U-SQL Meta Data Catalog (SQLBits 2016)
Michael Rys
 
PPTX
Introduction to HiveQL
kristinferrier
 
PDF
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Julian Hyde
 
PDF
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Julian Hyde
 
PDF
What's new in Mondrian 4?
Julian Hyde
 
PDF
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Julian Hyde
 
PPTX
Hive and HiveQL - Module6
Rohit Agrawal
 
PPTX
U-SQL Reading & Writing Files (SQLBits 2016)
Michael Rys
 
PPTX
Using C# with U-SQL (SQLBits 2016)
Michael Rys
 
PPTX
Taming the Data Science Monster with A New ‘Sword’ – U-SQL
Michael Rys
 
PPTX
Ten tools for ten big data areas 04_Apache Hive
Will Du
 
PDF
Don’t optimize my queries, optimize my data!
Julian Hyde
 
Data Migration with Spark to Hive
Databricks
 
PDI data vault framework #pcmams 2012
Jos van Dongen
 
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
Michael Rys
 
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
Michael Rys
 
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
Michael Rys
 
Discardable In-Memory Materialized Queries With Hadoop
Julian Hyde
 
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Michael Rys
 
Apache Hive
Abhishek Gautam
 
U-SQL Meta Data Catalog (SQLBits 2016)
Michael Rys
 
Introduction to HiveQL
kristinferrier
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Julian Hyde
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Julian Hyde
 
What's new in Mondrian 4?
Julian Hyde
 
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Julian Hyde
 
Hive and HiveQL - Module6
Rohit Agrawal
 
U-SQL Reading & Writing Files (SQLBits 2016)
Michael Rys
 
Using C# with U-SQL (SQLBits 2016)
Michael Rys
 
Taming the Data Science Monster with A New ‘Sword’ – U-SQL
Michael Rys
 
Ten tools for ten big data areas 04_Apache Hive
Will Du
 
Don’t optimize my queries, optimize my data!
Julian Hyde
 

Similar to Be A Hero: Transforming GoPro Analytics Data Pipeline (20)

PPTX
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Chester Chen
 
PPTX
Introducing U-SQL (SQLPASS 2016)
Michael Rys
 
PPTX
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Cloudera, Inc.
 
PPTX
Using existing language skillsets to create large-scale, cloud-based analytics
Microsoft Tech Community
 
PPTX
Dynamic DDL: Adding structure to streaming IoT data on the fly
DataWorks Summit
 
PPTX
Jump Start with Apache Spark 2.0 on Databricks
Databricks
 
PDF
Suburface 2021 IBM Cloud Data Lake
Torsten Steinbach
 
PDF
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
Michael Stack
 
PDF
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Databricks
 
PDF
IBM Cloud Native Day April 2021: Serverless Data Lake
Torsten Steinbach
 
PDF
From Raw Data to Analytics with No ETL
Cloudera, Inc.
 
PDF
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit
 
PDF
Cassandra Basics, Counters and Time Series Modeling
Vassilis Bekiaris
 
PDF
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
PDF
SQL on Hadoop
nvvrajesh
 
PPTX
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Fwdays
 
PDF
Apache Drill talk ApacheCon 2018
Aman Sinha
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PPTX
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Chester Chen
 
PPTX
Shark
Alex Ivy
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Chester Chen
 
Introducing U-SQL (SQLPASS 2016)
Michael Rys
 
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Cloudera, Inc.
 
Using existing language skillsets to create large-scale, cloud-based analytics
Microsoft Tech Community
 
Dynamic DDL: Adding structure to streaming IoT data on the fly
DataWorks Summit
 
Jump Start with Apache Spark 2.0 on Databricks
Databricks
 
Suburface 2021 IBM Cloud Data Lake
Torsten Steinbach
 
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
Michael Stack
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Databricks
 
IBM Cloud Native Day April 2021: Serverless Data Lake
Torsten Steinbach
 
From Raw Data to Analytics with No ETL
Cloudera, Inc.
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit
 
Cassandra Basics, Counters and Time Series Modeling
Vassilis Bekiaris
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
SQL on Hadoop
nvvrajesh
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Fwdays
 
Apache Drill talk ApacheCon 2018
Aman Sinha
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Chester Chen
 
Shark
Alex Ivy
 
Ad

More from Chester Chen (20)

PDF
SFBigAnalytics_SparkRapid_20220622.pdf
Chester Chen
 
PDF
zookeeer+raft-2.pdf
Chester Chen
 
PPTX
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
Chester Chen
 
PDF
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
Chester Chen
 
PDF
A missing link in the ML infrastructure stack?
Chester Chen
 
PDF
Shopify datadiscoverysf bigdata
Chester Chen
 
PDF
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
Chester Chen
 
PDF
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
Chester Chen
 
PDF
SFBigAnalytics_20190724: Monitor kafka like a Pro
Chester Chen
 
PDF
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
Chester Chen
 
PPTX
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
PPTX
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
Chester Chen
 
PDF
SFBigAnalytics- hybrid data management using cdap
Chester Chen
 
PDF
Sf big analytics: bighead
Chester Chen
 
PPTX
2018 data warehouse features in spark
Chester Chen
 
PDF
2018 02-08-what's-new-in-apache-spark-2.3
Chester Chen
 
PPTX
2018 02 20-jeg_index
Chester Chen
 
PDF
Index conf sparkml-feb20-n-pentreath
Chester Chen
 
PDF
Index conf sparkai-feb20-n-pentreath
Chester Chen
 
PPTX
Hspark index conf
Chester Chen
 
SFBigAnalytics_SparkRapid_20220622.pdf
Chester Chen
 
zookeeer+raft-2.pdf
Chester Chen
 
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
Chester Chen
 
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
Chester Chen
 
A missing link in the ML infrastructure stack?
Chester Chen
 
Shopify datadiscoverysf bigdata
Chester Chen
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
Chester Chen
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
Chester Chen
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
Chester Chen
 
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
Chester Chen
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
Chester Chen
 
SFBigAnalytics- hybrid data management using cdap
Chester Chen
 
Sf big analytics: bighead
Chester Chen
 
2018 data warehouse features in spark
Chester Chen
 
2018 02-08-what's-new-in-apache-spark-2.3
Chester Chen
 
2018 02 20-jeg_index
Chester Chen
 
Index conf sparkml-feb20-n-pentreath
Chester Chen
 
Index conf sparkai-feb20-n-pentreath
Chester Chen
 
Hspark index conf
Chester Chen
 
Ad

Recently uploaded (20)

PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PPTX
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PPTX
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
PDF
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
PDF
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
What Is Data Integration and Transformation?
subhashenia
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
big data eco system fundamentals of data science
arivukarasi
 
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
Research Methodology Overview Introduction
ayeshagul29594
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 

Be A Hero: Transforming GoPro Analytics Data Pipeline

  • 1. Be A Hero: Transforming GoPro Analytics Data Pipeline Machine Learning Innovation Summit , 2017 Chester Chen
  • 2. ABOUT SPEAKER • Head of Data Science & Engineering • Prev. Director of Engineering, Alpine Data Labs • Start to play with Spark since Spark 0.6 version. • Spoke at Hadoop Summit, Big Data Scala, IEEE Big Data Conferences • Organizer of SF Big Analytics meetup (6800+ members)
  • 3. AGENDA • Business Use Cases • Data Platform Architecture • Old Data Platforms: Pro & Cons and Challenges • New Data Platform Architeture and Initiatives • Adding Data Schema During Ingestion (Dynamic DDL)
  • 5. GROWING DATA NEEDS FROM GOPRO ECOSYSTEM
  • 6. GROWING DATA NEEDS FROM GOPRO ECOSYSTEM DATA Analytics Platform Consumer Devices GoPro Apps E-Commerce Social Media/OTT 3rd party data Product Insight CRM/Marketing/ Personalization User segmentation
  • 9. OLD DATA PLATFORM ARCHITECTURE ETL Cluster •File dumps (Json, CSV) • Spark Jobs •Hive Secure Data Mart •End User Query •Impala / Sentry •Parquet Analytics Apps •HUE •Tableau Real Time Cluster • Log file streaming • Kafka • Spark • HBase Induction Framework • Batch Ingestion • Pre-processing • Scheduled download Rest API, FTP S3 sync Streaming ingestion Batch Ingestion
  • 10. STREAMING PIPELINE Streaming Cluster ELBHTTP Pipeline for processing of streaming logs To Batch ETL Cluster
  • 12. OLD BATCH DATA PIPELINE ETL Cluster HDFS HIVE Metastore To SDM Cluster From Streaming Pipeline Pull distcp Hard-code Hive SQL based predefined schema to load Json transform parquet and load to Hive Map-Reduce Jobs tend to fail Map-Reduce Jobs tend to fail HDFS HIVE Metastore distcp Aggregation Hard-coded SQL
  • 13. OLD ANALYTICS CLUSTER HDFS HIVE Metastore BI Reporting SDM From Batch Cluster Exploratory Analytics with Hue: Impala/Hive: SQL Kerberos distcp 3rd Party Service
  • 14. PROS AND CONS OF OLD ARCHITECTURE • Isolation of workloads • Fast ingest • Loosely coupled clusters • Secure analytics cluster • Multiple copies of data • Tightly coupled storage and compute • Lack of elasticity • Operational overhead of multiple clusters • Hard-coded batch Hive SQL not flexible to change • Multiple Hive meta stores • distcp across clusters can take a long time with increase of data volume PROS CONS
  • 15. PROS AND CONS OF OLD ARCHITECTURE • Not easy to scale • Storage and compute cost • Only have SQL interface, no predictive analytics tool • Not easy to adapt data schema changes CONS
  • 17. KEY INITIATIVES: INFRASTRUCTURE • Separate Compute and Storage • Move storage to S3 • Centralize Hive Metadata • Use ephemeral instance as compute cluster • Simplify the ETL ingestion process and eliminate the distcp • Elasticity • auto-scale compute cluster (expand & shrink based on demand) • Enhance Analytics Capabilities • introducing Notebook • Scala, Python, R etc. • AWS Cost Reduction • Reduce EBS storage cost • Dynamic DDL • add schema on the fly
  • 18. DATA PLATFORM ARCHITECTURE Real Time Cluster •Log file streaming •Kafka •Spark Batch Ingestion Framework •Batch Ingestion •Pre-processing Streaming ingestion Batch Ingestion S3 CLUSTERS HIVE METASTORE PLOT.LY SERVER TABLEAU SERVER EXTERNAL SERFVICE Notebook Rest API, FTP S3 sync,etc Parquet + DDL State Sync OLAP Aggregation
  • 19. NEW DYNAMIC DDL ARCHITECTURE Streaming Pipeline ELBHTTP Pipeline for processing of streaming logs S3 HIVE METASTORE transition Centralized Hive Meta Store
  • 20. DATA PLATFORM ARCHITECTURE Batch Pipeline pull S3 3rd Party Service export Centralized Hive Meta Store S3 HIVE METASTORE Ingestion/Aggregation/Snapshot with dynamic DDL State sync transition
  • 21. ANALYTICS ARCHITECTURE – IN PROGRESS BI Reporting/Visualization Exploratory/Predictive Analytics Spark SQL/Scala/python/R Hive Metastore DSE SELF- SERVICE PORTAL OLAP Aggregation
  • 22. Dynamic DDL: Adding Schema to Data on the fly
  • 23. WHAT IS DYNAMIC DDL? • Dynamically alter table and add column { { “userId”, “123”} {“eventId”, “abc”} } Flattened Columns record_userId, record_eventN Updated Table X A B C userId erventId a b c 123 abc A B C a b c Existing Table X
  • 24. WHY USE DYNAMIC DDL? • Reduce development time • Traditionally, adding new Event/Attribute/Column requires of a lot time among teams • Many Hive ETL SQL needs to be changed to every column changes. • One way to solve this problem is to use key-value pair table • Ingestion is easy, no changes needed for newly added event/attribute/column • Hard for Analytics, tabulated data are much easier to work with • Dynamical DDL • Automatically flatten attributes (for json data) • Turn data into columns
  • 25. DYNAMIC DDL – CREATE TABLE // manually create table due to Spark bug def createTable(sqlContext: SQLContext, columns: Seq[(String, String)], destInfo: OutputInfo, partitionColumns: Array[(ColumnDef, Column)]): DataFrame = { val partitionClause = if (partitionColumns.length == 0) "" else { s"""PARTITIONED BY (${partitionColumns.map(f => s"${f._1.name} ${f._1.`type`}").mkString(", ")})""" } val sqlStmt = s"""CREATE TABLE IF NOT EXISTS ${destInfo.tableName()} ( columns.map(f => s"${f._1} ${f._2}").mkString(", ")) $partitionClause STORED AS ${destInfo.destFormat.split('.').last} """.stripMargin //spark 2.x doesn't know create if not exists syntax, // still log AlreadyExistsException message. but no exception sqlContext.sql(sqlStmt) }
  • 26. DYNAMIC DDL – ALTER TABLE ADD COLUMNS //first find existing fields, then add new fields val tableDf = sqlContext.table(dbTableName) val exisingFields : Seq[StructField] = … val newFields: Seq[StructField] = … if (newFields.nonEmpty) { // spark 2.x bug https://issues.apache.org/jira/browse/SPARK-19261 val sqlStmt: String = s"""ALTER TABLE $dbTableName ADD COLUMNS ( ${newFields.map ( f => s"${f.name} ${f.dataType.typeName}” ).mkString(", ")}. )""" }
  • 27. DYNAMIC DDL – ALTER TABLE ADD COLUMNS (SPARK 2.0) //Hack for Spark 2.0, Spark 2.1 if (newFields.nonEmpty) { // spark 2.x bug https://issues.apache.org/jira/browse/SPARK-19261 alterTable(sqlContext, dbTableName, newFields) } def alterTable(sqlContext: SQLContext, tableName: String, newColumns: Seq[StructField]): Unit = { alterTable(sqlContext, getTableIdentifier(tableName), newColumns) } private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configuration) extends ExternalCatalog with Logging { …. }
  • 28. DYNAMIC DDL – PREPARE DATAFRAME // Reorder the columns in the incoming data frame to match the order in the destination table. Project all columns from the table modifiedDF = modifiedDF.select(tableDf.schema.fieldNames.map( f => { if (modifiedDF.columns.contains(f)) col(f) else lit(null).as(f) }): _*) // Coalesce the data frame into the desired number of partitions (files). // avoid too many partitions modifiedDF.coalesce(ioInfo.outputInfo.numberOfPartition)
  • 29. DYNAMIC DDL – BATCH SPECIFIC ISSUES • Issue 1 : Several log files are mapped into same table, and not all columns are present CSV file 1 A B A B C CSV file 2 A X Y Destination Table Table Writer B C
  • 30. DYNAMIC DDL – BATCH SPECIFIC ISSUES • Solution: • Find DataFrame with max number of columns, use it as base, and reorder columns against this DataFrame val newDfs : Option[ParSeq[DataFrame]] = maxLengthDF.map{ baseDf => dfs.map { df => df.select(baseDf.schema.fieldNames.map(f => if (df.columns.contains(f)) col(f) else lit(null).as(f)): _*) } }
  • 31. DYNAMIC DDL – BATCH SPECIFIC ISSUES • Issue2 : Too many log files -- performance • Solution: We consolidate several data log files Data Frame into chunks, each chunk with all Data Frames union together. val ys: Seq[Seq[DataFrame]] = destTableDFs.seq.grouped(mergeChunkSize).toSeq val dfs: ParSeq[DataFrame] = ys.par.map(p => p.foldLeft(emptyDF) { (z, a) => z.unionAll(a) }) dfs.foreach(saveDataFrame(info, _))
  • 33. SUMMARY •GoPro Data Platform is in transition and we just get started •Central Hive Meta store + S3  separate storage + computing, reduce cost •Introducing cloud computing for elasticity and reduce operation complexity •Leverage dynamic DDL for flexible ingestion, aggregation and snapshot for both batch and streaming
  • 34. PG # RC Playbook: Your guide to success at GoPro Questions?

Editor's Notes

  • #7: Variety of Data Software – Mobile, Desktop and Cloud Apps Hardware – Camera, Drone, Drone Controller, VR, Accessories, Developer Program 3rd Party data – CRM, Social Media, OTT, E-Commerce etc. Variety of data Ingestion mechanism Real-Time Streaming pipeline Batch pipeline -- pushed or pulled data Complex data transformation Data often stored as binary to conserve space in camera Special logics for pair events and flight time correction Heterogeneous data format (json, csv, binary) Seamless data aggregation Combine data from both hardware and software Building data structures of both event-based and state-based
  • #8: Scalability Challenges Increase number of data sources and services requests Quick visibility of the data Infrastructure scalability Data Quality Tools and infrastructure for QA process Hadoop DevOp Challenges Manage Hadoop hardware (disk, security, service, dev and staging clusters) Monitoring Data Pipeline Monitoring metrics and infrastructure Enable Predictive Analytics Tools for Machine Learning and exploratory analytics Cost management AWS (storage & computing) as well as License costs
  • #24: .