SlideShare a Scribd company logo
The Little Warehouse That
Couldn’t Or: How We Learned to
Stop Worrying and Move to Spark
1
Yandu Oppacher (@yandu)
Data Infrastructure
2
Shopify Stores
ETL Warehouse Reporting
August 2013
TilllerRuby Vertica
3
Why we had to move
• Data volume
• Data/Query complexity
• Performance issues
4
Couple of false starts
5
Pig + Luigi
Pig + Oozie
Platfora
–platfora.com
“Without coding or ETL, data
warehousing, BI tools, or breaking a
sweat.”
6
Enter Spark
• Fast
• Nice development model
• Python
7
88
The Good Book
GMV
A Case Study
9
165,000+
ACTIVE SHOPIFY MERCHANTS
$8 BILLION+
CUMULATIVE GMV
Growing pains
• Joins
• Groupings
• General data skew
• Getting to know python’s performance quirks
10
Starscream
11
• specialized joins
• resolvers
• range
• cassandra
• overby
• contracts
• incrementalized fact
builds
Our current stack
12
Kafka
OLTP
HDFS
Cassandra
Spark
FrontroomBackroom
Redshift
Tableau
Thank you
13
Yandu Oppacher (@yandu)
Data Infrastructure

More Related Content

PDF
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Databricks
 
PDF
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
Spark Summit
 
PDF
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Databricks
 
PPTX
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Spark Summit
 
PDF
Big Data Meets Learning Science: Keynote by Al Essa
Spark Summit
 
PPTX
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Spark Summit
 
PDF
Data Warehousing with Spark Streaming at Zalando
Databricks
 
PDF
The Power of Unified Analytics with Ali Ghodsi
Databricks
 
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Databricks
 
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
Spark Summit
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Databricks
 
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Spark Summit
 
Big Data Meets Learning Science: Keynote by Al Essa
Spark Summit
 
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Spark Summit
 
Data Warehousing with Spark Streaming at Zalando
Databricks
 
The Power of Unified Analytics with Ali Ghodsi
Databricks
 

What's hot (20)

PPTX
Zeppelin at Twitter
Prasad Wagle
 
PDF
ASPgems - kappa architecture
Juantomás García Molina
 
PDF
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Databricks
 
PPTX
Disrupting Big Data with Apache Spark in the Cloud
Jen Aman
 
PDF
Simplify and Scale Data Engineering Pipelines with Delta Lake
Databricks
 
PDF
Spark and Online Analytics: Spark Summit East talky by Shubham Chopra
Spark Summit
 
PPTX
How Apache Spark Is Helping Tame the Wild West of Wi-Fi
Spark Summit
 
PDF
Spark Summit EU talk by Pat Patterson
Spark Summit
 
PDF
Shifting Data Science into High Gear
Spark Summit
 
PDF
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Databricks
 
PDF
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Databricks
 
PDF
Insights Without Tradeoffs: Using Structured Streaming
Databricks
 
PDF
Bridging the Gap Between Datasets and DataFrames
Databricks
 
PPTX
Using Hadoop to build a Data Quality Service for both real-time and batch data
DataWorks Summit/Hadoop Summit
 
PDF
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Databricks
 
PPTX
Apache Spark in Scientific Applciations
Dr. Mirko Kämpf
 
PDF
Spark at Airbnb
Hao Wang
 
PPTX
Rapid Data Analytics @ Netflix
Data Con LA
 
PDF
Building Robust Production Data Pipelines with Databricks Delta
Databricks
 
PPTX
Realtime streaming architecture in INFINARIO
Jozo Kovac
 
Zeppelin at Twitter
Prasad Wagle
 
ASPgems - kappa architecture
Juantomás García Molina
 
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Databricks
 
Disrupting Big Data with Apache Spark in the Cloud
Jen Aman
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Databricks
 
Spark and Online Analytics: Spark Summit East talky by Shubham Chopra
Spark Summit
 
How Apache Spark Is Helping Tame the Wild West of Wi-Fi
Spark Summit
 
Spark Summit EU talk by Pat Patterson
Spark Summit
 
Shifting Data Science into High Gear
Spark Summit
 
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Databricks
 
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Databricks
 
Insights Without Tradeoffs: Using Structured Streaming
Databricks
 
Bridging the Gap Between Datasets and DataFrames
Databricks
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
DataWorks Summit/Hadoop Summit
 
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Databricks
 
Apache Spark in Scientific Applciations
Dr. Mirko Kämpf
 
Spark at Airbnb
Hao Wang
 
Rapid Data Analytics @ Netflix
Data Con LA
 
Building Robust Production Data Pipelines with Databricks Delta
Databricks
 
Realtime streaming architecture in INFINARIO
Jozo Kovac
 
Ad

Viewers also liked (20)

PDF
Open Stack Cheat Sheet V1
Anuchit Chalothorn
 
PDF
Tachyon-2014-11-21-amp-camp5
Haoyuan Li
 
PDF
Linux Filesystems, RAID, and more
Mark Wong
 
PDF
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...
Spark Summit
 
PDF
The Hot Rod Protocol in Infinispan
Galder Zamarreño
 
PDF
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Ceph Community
 
PDF
Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift
Daniel Krook
 
PDF
Scaling up genomic analysis with ADAM
fnothaft
 
PPTX
ELC-E 2010: The Right Approach to Minimal Boot Times
andrewmurraympc
 
PDF
Velox: Models in Action
Dan Crankshaw
 
ODP
Naïveté vs. Experience
Mike Fogus
 
PDF
SparkR: Enabling Interactive Data Science at Scale
jeykottalam
 
PDF
SampleClean: Bringing Data Cleaning into the BDAS Stack
jeykottalam
 
PDF
OpenStack Cheat Sheet V2
Anuchit Chalothorn
 
PDF
A Curious Course on Coroutines and Concurrency
David Beazley (Dabeaz LLC)
 
PDF
Lab 5: Interconnecting a Datacenter using Mininet
Zubair Nabi
 
PDF
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark Summit
 
PDF
Best Practices for Virtualizing Apache Hadoop
Hortonworks
 
PDF
In Search of the Perfect Global Interpreter Lock
David Beazley (Dabeaz LLC)
 
PDF
Python in Action (Part 2)
David Beazley (Dabeaz LLC)
 
Open Stack Cheat Sheet V1
Anuchit Chalothorn
 
Tachyon-2014-11-21-amp-camp5
Haoyuan Li
 
Linux Filesystems, RAID, and more
Mark Wong
 
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...
Spark Summit
 
The Hot Rod Protocol in Infinispan
Galder Zamarreño
 
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Ceph Community
 
Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift
Daniel Krook
 
Scaling up genomic analysis with ADAM
fnothaft
 
ELC-E 2010: The Right Approach to Minimal Boot Times
andrewmurraympc
 
Velox: Models in Action
Dan Crankshaw
 
Naïveté vs. Experience
Mike Fogus
 
SparkR: Enabling Interactive Data Science at Scale
jeykottalam
 
SampleClean: Bringing Data Cleaning into the BDAS Stack
jeykottalam
 
OpenStack Cheat Sheet V2
Anuchit Chalothorn
 
A Curious Course on Coroutines and Concurrency
David Beazley (Dabeaz LLC)
 
Lab 5: Interconnecting a Datacenter using Mininet
Zubair Nabi
 
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark Summit
 
Best Practices for Virtualizing Apache Hadoop
Hortonworks
 
In Search of the Perfect Global Interpreter Lock
David Beazley (Dabeaz LLC)
 
Python in Action (Part 2)
David Beazley (Dabeaz LLC)
 
Ad

Similar to The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Move to Spark-(Yandu Oppacher, Shopify) (20)

PPTX
Pig on Spark
mortardata
 
PPTX
5 Things that Make Hadoop a Game Changer
Caserta
 
PDF
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Databricks
 
ODP
100 Exadata Implementations Later-Tim Fox
Enkitec
 
PDF
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
Inside Analysis
 
PPTX
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
 
PDF
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
Big Data Spain
 
PPTX
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Douglas Moore
 
PDF
Hadoop and SAP BI
Praveen Kumar (Tyagi)
 
PPTX
Michael Stonebraker: Big Data, Disruption, and the 800 Pound Gorilla in the ...
TamrMarketing
 
PDF
Splice machine-bloor-webinar-data-lakes
Edgar Alejandro Villegas
 
PDF
Cloud-native Semantic Layer on Data Lake
Databricks
 
PDF
Stream based Data Integration
Jeffrey T. Pollock
 
PDF
SnappyData Toronto Meetup Nov 2017
SnappyData
 
PDF
Unlocked: the Hybrid Cloud - 12th May 2014 / All Slides (morning)
Rackspace Academy
 
PDF
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
confluent
 
PDF
Big Data, Big Dream
Wayne Weixin
 
PPTX
Apache Tajo - BWC 2014
Gruter
 
PDF
Spark Summit EU talk by Bas Geerdink
Spark Summit
 
PPTX
Distributed Data Quality - Technical Solutions for Organizational Scaling
Justin Cunningham
 
Pig on Spark
mortardata
 
5 Things that Make Hadoop a Game Changer
Caserta
 
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Databricks
 
100 Exadata Implementations Later-Tim Fox
Enkitec
 
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
Inside Analysis
 
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
Big Data Spain
 
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Douglas Moore
 
Hadoop and SAP BI
Praveen Kumar (Tyagi)
 
Michael Stonebraker: Big Data, Disruption, and the 800 Pound Gorilla in the ...
TamrMarketing
 
Splice machine-bloor-webinar-data-lakes
Edgar Alejandro Villegas
 
Cloud-native Semantic Layer on Data Lake
Databricks
 
Stream based Data Integration
Jeffrey T. Pollock
 
SnappyData Toronto Meetup Nov 2017
SnappyData
 
Unlocked: the Hybrid Cloud - 12th May 2014 / All Slides (morning)
Rackspace Academy
 
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
confluent
 
Big Data, Big Dream
Wayne Weixin
 
Apache Tajo - BWC 2014
Gruter
 
Spark Summit EU talk by Bas Geerdink
Spark Summit
 
Distributed Data Quality - Technical Solutions for Organizational Scaling
Justin Cunningham
 

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
PDF
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
PDF
Goal Based Data Production with Sim Simeonov
Spark Summit
 
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

Recently uploaded (20)

PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PDF
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PDF
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PDF
oop_java (1) of ice or cse or eee ic.pdf
sabiquntoufiqlabonno
 
PPTX
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
Employee Salary Presentation.l based on data science collection of data
barridevakumari2004
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PPTX
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
PPTX
Short term internship project report on power Bi
JMJCollegeComputerde
 
PPTX
Economic Sector Performance Recovery.pptx
yulisbaso2020
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
oop_java (1) of ice or cse or eee ic.pdf
sabiquntoufiqlabonno
 
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Employee Salary Presentation.l based on data science collection of data
barridevakumari2004
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
Short term internship project report on power Bi
JMJCollegeComputerde
 
Economic Sector Performance Recovery.pptx
yulisbaso2020
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 

The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Move to Spark-(Yandu Oppacher, Shopify)