SlideShare a Scribd company logo
1
Snowplow and
Cascalog
METAIL - YOUR ONLINE FITTING ROOM
Presentation by Rob Boland, Lead Data Architect
2
Introduction
• Introduction to Metail – who we are, why we use Snowplow
• How the Lambda Architecture has influenced our Data
Architecture
• Where Cascalog fits in at Metail and why it works well with
Snowplow
• Example of where we’ve used Cascalog and how it works
• Looker forward to the future
3
Every body is unique and
should be celebrated
4
YOUR ONLINE FITTING ROOM
5
• Sign up with just a few clicks
• See how the clothes look on you
• Build layered outfits
• Get size recommendation
http://trymetail.com/collections/metail
6
1. Customer shape & size data can now aid brand’s buying & selling decisions
2. Body shape & outfitting data -> crowd sourced outfit recommendations
Product portfolio: Data services
UNDERSTANDING SHAPE PROFILE OF CUSTOMERS HOW SHAPE VARIES BY SIZE
Do we need to create new collections
to cater for clusters of different shapes?
Do we need to change the fit profile by
size to accommodate different shapes?
7
KPI Analysis –
Can we prove it actually works?
Metric Definition
Return on Investment [(VPVuplift * All Visits ) - Investment] / Investment
Net sales revenue Value of retained items in bin
Value per visitor Net Sales Revenue / Visitors
Visits (sessions) Set of activities with <= 30 minutes between consecutive events
User Conversion Orders / Visitors
Adoption Rate Number of user’s who use Metail / Number of user’s shown Metail
Average Order Value Median value of all orders tracked in the time period
Return Rate Number of items returned / Number of Items purchased
Average Retained Order
Value
Median value of all orders tracked in the time period after removing
returned items
AB Set up: 50/50 split test
Managed by: Metail through their AB test platform
8
KPI Analysis –
Can we prove Metail impact?
Data Collection
We need to know visitor counts, order values, which test group the
user was in, whether they actually used Metail or not, time on site,
what garments they wore, etc. etc.
9
Enter Snowplow
10
What Metail looks like (for now…)
11
Data Collection! Now what?
Read the Big Data book
(Still MEAP after 3 years!)
12
Lambda Architecture
13
Cascalog to produce Batch Views
Turn the Snowplow event stream into a normalised schema
Body Shape
Orders
Items Ordered
Returns
Browsers
(visitors)
Sessions
Garment Details
AB Events
Snowplow
Events
14
Cascalog:
Snowplow ETL Runner Output -> Batch Views
Cascalog is designed to process Big Data on top of Hadoop. It is a
replacement for tools like Pig, Hive, and Cascading which operates at a
significantly higher level of abstraction than those tools [1]
Write Clojure code to create our data processing jobs
• The code you write has be MapReduce aware, but the low level
implementation details are taken care of
• What we’re really doing is adding another ETL Step to the Snowplow flow
[1] http://cascalog.org/
Cascalog is written in Clojure (JCascalog in Java, or Scalding in Scala)
It’s easy to run on Amazon EMR – fits in with the Snowplow flow nicely
15
Cascalog – Worth the effort?
Couldn’t you achieve the same output working with the
events table alone?
…kind of
But there are two key benefits:
1. Breaking the data into a manageable schema means you can
directly access the data you care about
2. Complex logic and aggregation is easier to achieve
Real example:
• KPI Data Aggregation
16
Cascalog – KPI Data Aggregation
Value per visitor Net Sales Revenue / Visitors
User Conversion Orders / Visitors
Adoption Rate Number of user’s who use Metail / Number of user’s shown Metail
How do we calculate KPIs from our Snowplow data?
In both the Active and Control groups, we need:
• Visitor Count
• Engaged Visitor Count
• Order Count
• Order Value
17
Cascalog – KPI Data Aggregation
Visitors
Count
• Snowplow tracks visitors – our code just has to look up visitors who
are in the test we’re measuring
Engaged Count
• Fire a structured event to Snowplow each time an ‘engagement’ event
occurs. For each visitor in the test, our code has to find whether or
not they engaged with Metail
Orders
We encode all of the relevant order information on the page in JSON and
fire an unstructured event with the details
Order Count
• Our code needs to find all of the order events in the time period
Order Value
• Our code needs to read the order value and sum it together
18
Cascalog – KPI Data Aggregation
We can do better!
What we really want is a user level summary of the data
domain_id engaged order_value order_id ab_group
0014822757d9a81f null 175.89 89281949 out
0015ca5144f0fae7 null null null out
0015dd8901887010 null 310.22 25394849 out
0015e633aa2c158d null null null in
00204e1bcc87b734 null null null out
0042472794f2b57a null 191.98 89392136 in
004389f95e620dd0 null null null out
0044867c3d7b1cf5 null null null out
00456d1e9300296e null null null out
0045dc05b4262ed2 null null null in
0045f74358a842c1 TRUE null null in
00462b685f4188ad null null null out
0048fccbe230dc57 null null null out
0049a5d24498051d TRUE 101.96 27529849 in
19
Cascalog – Implementation
1) Read in the Snowplow events data in HDFS
2) Remove events we don’t care about
20
Cascalog – Implementation
3) Take those events, pull out the bits we care about and join them together
21
What do we do with the Batch Views?
Take the output and crunch it in R (or Incanter)
A lot of the subsequent analysis we run on our batch views requires
statistical packages, so we run our advanced analysis in R.
Thankfully, having the batch views ready has led to far fewer of these:
22
A Looker Ahead
Not everyone can write Cascalog and R.
Looker will open our batch views and Snowplow events to
our Business Analysts
23
www.metail.com
Contact information
ROB BOLAND
LEAD DATA ARCHITECT
rob@metail.com
Skype: rpboland

More Related Content

PPTX
Snowplow Analytics: from NoSQL to SQL and back again
Alexander Dean
 
PPTX
Simply Business and Snowplow - Multichannel Attribution Analysis
Stewart Duncan
 
PDF
Simply Business - Near Real Time Event Processing
idan_by
 
PPTX
A taste of Snowplow Analytics data
Robert Kingston
 
PDF
Big data meetup budapest adding data schemas to snowplow
yalisassoon
 
PPTX
Understanding event data
yalisassoon
 
PDF
2016 09 measurecamp - event data modeling
yalisassoon
 
PPTX
Why use big data tools to do web analytics? And how to do it using Snowplow a...
yalisassoon
 
Snowplow Analytics: from NoSQL to SQL and back again
Alexander Dean
 
Simply Business and Snowplow - Multichannel Attribution Analysis
Stewart Duncan
 
Simply Business - Near Real Time Event Processing
idan_by
 
A taste of Snowplow Analytics data
Robert Kingston
 
Big data meetup budapest adding data schemas to snowplow
yalisassoon
 
Understanding event data
yalisassoon
 
2016 09 measurecamp - event data modeling
yalisassoon
 
Why use big data tools to do web analytics? And how to do it using Snowplow a...
yalisassoon
 

What's hot (20)

PDF
Snowplow - Evolve your analytics stack with your business
Giuseppe Gaviani
 
PDF
How to evolve your analytics stack with your business using Snowplow
Giuseppe Gaviani
 
PPTX
Modelling event data in look ml
yalisassoon
 
PPTX
Big Data Beers - Introducing Snowplow
Alexander Dean
 
PDF
Snowplow: open source game analytics powered by AWS
Giuseppe Gaviani
 
PDF
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016
yalisassoon
 
PPTX
How we use Hive at SnowPlow, and how the role of HIve is changing
yalisassoon
 
PDF
Data driven video advertising campaigns - JustWatch & Snowplow
Giuseppe Gaviani
 
PDF
Introducing Sauna - Decisioning and response platform from Snowplow
Giuseppe Gaviani
 
PPTX
Implementing improved and consistent arbitrary event tracking company-wide us...
yalisassoon
 
PDF
Snowplow presentation for Amsterdam Meetup #3
Snowplow Analytics
 
PDF
How Gousto is moving to just-in-time personalization with Snowplow
Giuseppe Gaviani
 
PPTX
Flows in the Service Console, Gotta Go with the Flow! by Duncan Stewart
Salesforce Admins
 
PPTX
Snowplow: where we came from and where we are going - March 2016
yalisassoon
 
PPTX
Snowplow the evolving data pipeline
yalisassoon
 
PDF
Snowplow at DA Hub emerging technology showcase
yalisassoon
 
PPTX
The analytics journey at Viewbix - how they came to use Snowplow and the setu...
yalisassoon
 
PDF
Snowplow: evolve your analytics stack with your business
yalisassoon
 
PPTX
Unified Log London (May 2015) - Why your company needs a unified log
Alexander Dean
 
PDF
Snowplow: putting digital analysts at the heart of digital analytics - the fo...
yalisassoon
 
Snowplow - Evolve your analytics stack with your business
Giuseppe Gaviani
 
How to evolve your analytics stack with your business using Snowplow
Giuseppe Gaviani
 
Modelling event data in look ml
yalisassoon
 
Big Data Beers - Introducing Snowplow
Alexander Dean
 
Snowplow: open source game analytics powered by AWS
Giuseppe Gaviani
 
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016
yalisassoon
 
How we use Hive at SnowPlow, and how the role of HIve is changing
yalisassoon
 
Data driven video advertising campaigns - JustWatch & Snowplow
Giuseppe Gaviani
 
Introducing Sauna - Decisioning and response platform from Snowplow
Giuseppe Gaviani
 
Implementing improved and consistent arbitrary event tracking company-wide us...
yalisassoon
 
Snowplow presentation for Amsterdam Meetup #3
Snowplow Analytics
 
How Gousto is moving to just-in-time personalization with Snowplow
Giuseppe Gaviani
 
Flows in the Service Console, Gotta Go with the Flow! by Duncan Stewart
Salesforce Admins
 
Snowplow: where we came from and where we are going - March 2016
yalisassoon
 
Snowplow the evolving data pipeline
yalisassoon
 
Snowplow at DA Hub emerging technology showcase
yalisassoon
 
The analytics journey at Viewbix - how they came to use Snowplow and the setu...
yalisassoon
 
Snowplow: evolve your analytics stack with your business
yalisassoon
 
Unified Log London (May 2015) - Why your company needs a unified log
Alexander Dean
 
Snowplow: putting digital analysts at the heart of digital analytics - the fo...
yalisassoon
 
Ad

Similar to Snowplow, Metail and Cascalog (20)

PDF
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Spark Summit
 
PPTX
How to Realize an Additional 270% ROI on Snowflake
AtScale
 
PPTX
Your Roadmap for An Enterprise Graph Strategy
Neo4j
 
PPTX
Neo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j
 
PDF
Columnstore improvements in SQL Server 2016
Niko Neugebauer
 
PDF
Your Roadmap for An Enterprise Graph Strategy
Neo4j
 
PDF
Microsoft Dynamics 365 IA - Copilot/ Fabric
Juan Fabian
 
PDF
Pipelining the Heroes with Kafka and Graph
confluent
 
PDF
Predictive Conversion Modeling - Lifting Web Analytics to the next level
Petri Mertanen
 
PDF
Roadmap for Enterprise Graph Strategy
Neo4j
 
PPTX
Project report aditi paul1
guest9529cb
 
PDF
Microsoft Dynamics 365 Commerce y Copilot
Juan Fabian
 
PPTX
Understanding Web Analytics and Google Analytics
Prathamesh Kulkarni
 
PPTX
Transform your Entire Customer Life Cycle, at Enterprise Scale by Marc Aubin ...
Salesforce Admins
 
PDF
Your Roadmap for An Enterprise Graph Strategy
Neo4j
 
PPTX
Ecommerce analytics with machine learning models.pptx
pera123sas
 
PPTX
Connecting the odds in the brave world! Sitecore Commerce Connect
suneco_nl
 
PDF
How to drive real business value from your virtual Supply Chain twin?
Bluecrux
 
PDF
Applied Machine Learning for Ranking Products in an Ecommerce Setting
Databricks
 
PPS
1KEY Multidimensional
Dhiren Gala
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Spark Summit
 
How to Realize an Additional 270% ROI on Snowflake
AtScale
 
Your Roadmap for An Enterprise Graph Strategy
Neo4j
 
Neo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j
 
Columnstore improvements in SQL Server 2016
Niko Neugebauer
 
Your Roadmap for An Enterprise Graph Strategy
Neo4j
 
Microsoft Dynamics 365 IA - Copilot/ Fabric
Juan Fabian
 
Pipelining the Heroes with Kafka and Graph
confluent
 
Predictive Conversion Modeling - Lifting Web Analytics to the next level
Petri Mertanen
 
Roadmap for Enterprise Graph Strategy
Neo4j
 
Project report aditi paul1
guest9529cb
 
Microsoft Dynamics 365 Commerce y Copilot
Juan Fabian
 
Understanding Web Analytics and Google Analytics
Prathamesh Kulkarni
 
Transform your Entire Customer Life Cycle, at Enterprise Scale by Marc Aubin ...
Salesforce Admins
 
Your Roadmap for An Enterprise Graph Strategy
Neo4j
 
Ecommerce analytics with machine learning models.pptx
pera123sas
 
Connecting the odds in the brave world! Sitecore Commerce Connect
suneco_nl
 
How to drive real business value from your virtual Supply Chain twin?
Bluecrux
 
Applied Machine Learning for Ranking Products in an Ecommerce Setting
Databricks
 
1KEY Multidimensional
Dhiren Gala
 
Ad

Recently uploaded (20)

PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PDF
Doc9.....................................
SofiaCollazos
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
Doc9.....................................
SofiaCollazos
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
Simple and concise overview about Quantum computing..pptx
mughal641
 

Snowplow, Metail and Cascalog

  • 1. 1 Snowplow and Cascalog METAIL - YOUR ONLINE FITTING ROOM Presentation by Rob Boland, Lead Data Architect
  • 2. 2 Introduction • Introduction to Metail – who we are, why we use Snowplow • How the Lambda Architecture has influenced our Data Architecture • Where Cascalog fits in at Metail and why it works well with Snowplow • Example of where we’ve used Cascalog and how it works • Looker forward to the future
  • 3. 3 Every body is unique and should be celebrated
  • 5. 5 • Sign up with just a few clicks • See how the clothes look on you • Build layered outfits • Get size recommendation http://trymetail.com/collections/metail
  • 6. 6 1. Customer shape & size data can now aid brand’s buying & selling decisions 2. Body shape & outfitting data -> crowd sourced outfit recommendations Product portfolio: Data services UNDERSTANDING SHAPE PROFILE OF CUSTOMERS HOW SHAPE VARIES BY SIZE Do we need to create new collections to cater for clusters of different shapes? Do we need to change the fit profile by size to accommodate different shapes?
  • 7. 7 KPI Analysis – Can we prove it actually works? Metric Definition Return on Investment [(VPVuplift * All Visits ) - Investment] / Investment Net sales revenue Value of retained items in bin Value per visitor Net Sales Revenue / Visitors Visits (sessions) Set of activities with <= 30 minutes between consecutive events User Conversion Orders / Visitors Adoption Rate Number of user’s who use Metail / Number of user’s shown Metail Average Order Value Median value of all orders tracked in the time period Return Rate Number of items returned / Number of Items purchased Average Retained Order Value Median value of all orders tracked in the time period after removing returned items AB Set up: 50/50 split test Managed by: Metail through their AB test platform
  • 8. 8 KPI Analysis – Can we prove Metail impact? Data Collection We need to know visitor counts, order values, which test group the user was in, whether they actually used Metail or not, time on site, what garments they wore, etc. etc.
  • 10. 10 What Metail looks like (for now…)
  • 11. 11 Data Collection! Now what? Read the Big Data book (Still MEAP after 3 years!)
  • 13. 13 Cascalog to produce Batch Views Turn the Snowplow event stream into a normalised schema Body Shape Orders Items Ordered Returns Browsers (visitors) Sessions Garment Details AB Events Snowplow Events
  • 14. 14 Cascalog: Snowplow ETL Runner Output -> Batch Views Cascalog is designed to process Big Data on top of Hadoop. It is a replacement for tools like Pig, Hive, and Cascading which operates at a significantly higher level of abstraction than those tools [1] Write Clojure code to create our data processing jobs • The code you write has be MapReduce aware, but the low level implementation details are taken care of • What we’re really doing is adding another ETL Step to the Snowplow flow [1] http://cascalog.org/ Cascalog is written in Clojure (JCascalog in Java, or Scalding in Scala) It’s easy to run on Amazon EMR – fits in with the Snowplow flow nicely
  • 15. 15 Cascalog – Worth the effort? Couldn’t you achieve the same output working with the events table alone? …kind of But there are two key benefits: 1. Breaking the data into a manageable schema means you can directly access the data you care about 2. Complex logic and aggregation is easier to achieve Real example: • KPI Data Aggregation
  • 16. 16 Cascalog – KPI Data Aggregation Value per visitor Net Sales Revenue / Visitors User Conversion Orders / Visitors Adoption Rate Number of user’s who use Metail / Number of user’s shown Metail How do we calculate KPIs from our Snowplow data? In both the Active and Control groups, we need: • Visitor Count • Engaged Visitor Count • Order Count • Order Value
  • 17. 17 Cascalog – KPI Data Aggregation Visitors Count • Snowplow tracks visitors – our code just has to look up visitors who are in the test we’re measuring Engaged Count • Fire a structured event to Snowplow each time an ‘engagement’ event occurs. For each visitor in the test, our code has to find whether or not they engaged with Metail Orders We encode all of the relevant order information on the page in JSON and fire an unstructured event with the details Order Count • Our code needs to find all of the order events in the time period Order Value • Our code needs to read the order value and sum it together
  • 18. 18 Cascalog – KPI Data Aggregation We can do better! What we really want is a user level summary of the data domain_id engaged order_value order_id ab_group 0014822757d9a81f null 175.89 89281949 out 0015ca5144f0fae7 null null null out 0015dd8901887010 null 310.22 25394849 out 0015e633aa2c158d null null null in 00204e1bcc87b734 null null null out 0042472794f2b57a null 191.98 89392136 in 004389f95e620dd0 null null null out 0044867c3d7b1cf5 null null null out 00456d1e9300296e null null null out 0045dc05b4262ed2 null null null in 0045f74358a842c1 TRUE null null in 00462b685f4188ad null null null out 0048fccbe230dc57 null null null out 0049a5d24498051d TRUE 101.96 27529849 in
  • 19. 19 Cascalog – Implementation 1) Read in the Snowplow events data in HDFS 2) Remove events we don’t care about
  • 20. 20 Cascalog – Implementation 3) Take those events, pull out the bits we care about and join them together
  • 21. 21 What do we do with the Batch Views? Take the output and crunch it in R (or Incanter) A lot of the subsequent analysis we run on our batch views requires statistical packages, so we run our advanced analysis in R. Thankfully, having the batch views ready has led to far fewer of these:
  • 22. 22 A Looker Ahead Not everyone can write Cascalog and R. Looker will open our batch views and Snowplow events to our Business Analysts

Editor's Notes

  • #4: Fashion technology start-up company Focused on delivering best UX for browsing and buying clothes online How? – by recognising every body is unique and should be celebrated! When looking at clothes online, why are we restricted to only seeing how they look on models or mannequins? Why not on our own bodies? That is the question we are solving through 2 core technologies: Body visualisation – having a quick and easy way to create your body model online - your MeModel Garment fit – low cost and quick method for digitising clothes The results? Well you can see for yourself from this slide, which shows a collection of MeModels we have created, wearing different clothes
  • #7: I’m not going to spend too much time on this slide, but I wanted to give an overview of the kind of data services we provide for our retailers and we put together from the data we collect
  • #9: GA just doesn’t give us the level of detail we require. It has it’s uses, and provides great overviews and visualisations, but drilling into the detail of what a user actually did gets a bit clunky. Funnel analysis never quite cut it for us, especially when it comes to measuring KPIs and billing where it’s really important it’s accurate and correct
  • #11: Key points to note: we are adding two trackers here, one that sits on the retailers site and one that sits on our widget. Because we have the tracker on the retailers pages, we get a lot more data than a startup of our size might expect We track everything, send a _lot_ of structured events (fell out of GA), and also use unstructured events where we’ve needed to pass more data We actually started our Snowplow collection before we really knew what to do with it. No harm getting the tracker on early
  • #12: MEAP for a mere three years – hopefully Unified Log Processing comes more quickly…
  • #13: Computing arbitrary functions on arbitrary data Batch layer – Stores the master dataset and computes arbitrary views Serving layer - The serving layer indexes the batch view and loads it up so it can be efficiently queried to get particular values out of the view. The serving layer is a specialized distributed database that loads in a batch views, makes them queryable, and continuously swaps in new versions of a batch view as they're computed by the batch layer. Speed layer - Takes the data and updates it based on what it knows, discards data as it’s no longer needed Robust and fault tolerant Scalable General Extensible Allows ad hoc queries Minimal maintenance Debuggable
  • #14: Entities we care about
  • #15: Batch computations are written like single-threaded programs, yet automatically parallelize across a cluster of machines. This implicit parallelization makes batch layer computations scale to datasets of any size. It's easy to write robust, highly scalable computations on the batch layer. Scale
  • #17: Remember our KPI slide – I’ve picked out a couple of these and I’m going to talk about how we use Snowplow to capture this data
  • #18: All of these things would be fairly easy to pull out of the processed Snowplow data – even if it’s large. Redshift is good at running these kind of queries. Combining the numbers returned is not difficult Problem if you present this back to the retailer or your users – there are always follow up questions and it’s difficult to drill down on this kind of summary data What kind of items do the users who engaged try on vs what they purchased? Can you tell me which users What days were there the most orders. Can you provide the order_ids so we could check the values our end?
  • #19: This is better because we now have the snowplow domain_id. It’s a summary view showing us, for any specific user in the test, which group they were in, did they click on the Metail button, did they make an order and if so how much? Tying everything back to the user is a great advantage, because any subsequent analysis is much easier to carry out. We join back to the Snowplow events on domain_id. For users who engaged: what did they try on? This data has just run in a batch so is ready and waiting for us to start analysis on – doesn’t need recomputed over again It’s also easy to calculate the KPIs I mentioned and because we have everything on a per user level, we can perform statistical bootstrapping to look at the distributions and work out errors bars on the results
  • #20: I know many of you will never have seen Clojure before and I don’t intend to spend time going through every line, but I wanted to show you that what we’re doing is conceptually very simple A few lines of code and we’ve cleared a huge amount of data we don’t need: Chuck invalid ip addresses Anything that’s not a Struct or an Unstruct event And we’ve started to transform it. Page urls become retailers
  • #21: Cascalog takes care of all of the nitty gritty – and running it on Amazon EMR means we can power it up as we’d like because you’re leveraging mapreduce. MapReduce – doesn’t matter how big your Snowplow logs are, you can split the data arbitrarily and run Cascalog over it. Every row can
  • #22: At the moment