SlideShare a Scribd company logo
Building a Data Warehouse on AWS
Amazon
S3
Amazon
Redshift
CollectCollect ProcessProcess AnalyzeAnalyze
StoreStore
Data Answers
Visualize
@Lynn Langit
AWS Marketplace
Enterprise software store for business users who need simplified procurement
•2000+ product listings
•to browse, test and buy software
•1-click deployment
•to launch, in multiple regions around the
world
•Pay-as-you-go pricing
•to use on demand
Advanced Analytics
Data Enablement
Business Intelligence
Building a Data Warehouse on AWS
Move data into Redshift
from S3 for analysis
Amazon
S3
Amazon
Redshift
AWS Marketplace
Partners
Matillion
Visualize
Yellowfin
CollectCollect ProcessProcess AnalyzeAnalyze
StoreStore
Data Answers
Setup
Our Scenario and Source Files
File Types
-- Text - .csv
-- Compressed - .gz
File Categories
Details / Events
-- Flights
-- Weather
Metadata
-- Airports
-- Carriers
“In this scenario we will use Matillion ETL
for Redshift to prepare two separate data
sources ready for analysis.
The sample data is US airport flight
information from 1995 -> 2008. Every flight
to or from a US airport (and whether it left
on time or not) is included.
The second data set is weather data, taken
from NOAA, including the daily weather
readings for each US Airport.”
Loading data from S3 in to Redshift
Using Matillion ETL for Redshift
• Create Instance (AMI/EC2) of Matillion/AWS Marketplace
• Connect Matillion to Redshift
Loading
Data in
Redshift
Table distribution styles
Distribution Key All
Node 1
Slice
1
Slice
1
Slice
2
Slice
2
Node 2
Slice
3
Slice
3
Slice
4
Slice
4
Node 1
Slice
1
Slice
1
Slice
2
Slice
2
Node 2
Slice
3
Slice
3
Slice
4
Slice
4
key1
key2
key3
key4
All data on
every node
Same key to same location
Node 1
Slice
1
Slice
1
Slice
2
Slice
2
Node 2
Slice
3
Slice
3
Slice
4
Slice
4
Even
Round robin
distribution
Sort Keys
• Single Column - [ SORTKEY ( date ) ]
• Queries that use 1st
column (i.e. date) as primary filter
• Compound - [ SORTKEY COMPOUND ( date, region,
country) ]
• Queries that use 1st
column as primary filter, then other columns
• Interleaved - [ SORTKEY INTERLEAVED ( date,
region, country) ]
• Queries that use different columns in filter
Time Series Data – Vacuum Operation
Unsorted
Region
Sorted
Region
Sorted
Sorted
Sorted
Append in Sort Key Order
Sort Unsorted
Region
Merge
Visualizing
with Yellowfin
Automate – https://github.com/lynnlangit/AWSDataWarehouse

More Related Content

What's hot (7)

PPTX
AWS Batch: Simplifying batch computing in the cloud
Adrian Hornsby
 
PDF
Big problems Big Data, simple solutions
Claudio Pontili
 
PPTX
Scaling Traffic from 0 to 139 Million Unique Visitors
Yelp Engineering
 
PPTX
Introduction to AWS Kinesis
Steven Ensslen
 
PDF
AWS Kinesis - Streams, Firehose, Analytics
Serhat Can
 
PPTX
Scaling Galaxy on Google Cloud Platform
Lynn Langit
 
PDF
Simplify Big Data with AWS
Julien SIMON
 
AWS Batch: Simplifying batch computing in the cloud
Adrian Hornsby
 
Big problems Big Data, simple solutions
Claudio Pontili
 
Scaling Traffic from 0 to 139 Million Unique Visitors
Yelp Engineering
 
Introduction to AWS Kinesis
Steven Ensslen
 
AWS Kinesis - Streams, Firehose, Analytics
Serhat Can
 
Scaling Galaxy on Google Cloud Platform
Lynn Langit
 
Simplify Big Data with AWS
Julien SIMON
 

Similar to Building a data warehouse with AWS Redshift, Matillion and Yellowfin (6)

PPTX
Aws meetup 20190427
Sridevi Murugayen
 
PPTX
Analyzing Mixpanel Data into Amazon Redshift
George Psistakis
 
PDF
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
saidbilgen
 
PDF
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
SasikumarPalanivel3
 
PDF
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
Sungmin Kim
 
PPTX
AWS Certified Solutions Architect Professional Course S15-S18
Neal Davis
 
Aws meetup 20190427
Sridevi Murugayen
 
Analyzing Mixpanel Data into Amazon Redshift
George Psistakis
 
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
saidbilgen
 
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
SasikumarPalanivel3
 
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
Sungmin Kim
 
AWS Certified Solutions Architect Professional Course S15-S18
Neal Davis
 
Ad

More from Lynn Langit (20)

PPTX
VariantSpark on AWS
Lynn Langit
 
PPTX
Serverless Architectures
Lynn Langit
 
PPTX
10+ Years of Teaching Kids Programming
Lynn Langit
 
PPTX
Blastn plus jupyter on Docker
Lynn Langit
 
PDF
Testing in Ballerina Language
Lynn Langit
 
PPTX
Teaching Kids to create Alexa Skills
Lynn Langit
 
PPTX
Practical cloud
Lynn Langit
 
PPTX
Understanding Jupyter notebooks using bioinformatics examples
Lynn Langit
 
PPTX
Genome-scale Big Data Pipelines
Lynn Langit
 
PPTX
Teaching Kids Programming
Lynn Langit
 
PPTX
Practical Cloud
Lynn Langit
 
PPTX
Serverless Reality
Lynn Langit
 
PPTX
Genomic Scale Big Data Pipelines
Lynn Langit
 
PPTX
VariantSpark - a Spark library for genomics
Lynn Langit
 
PPTX
Bioinformatics Data Pipelines built by CSIRO on AWS
Lynn Langit
 
PPTX
Google Cloud and Data Pipeline Patterns
Lynn Langit
 
PPTX
Redis Labs and SQL Server
Lynn Langit
 
PPTX
What is 'Teaching Kids Programming'
Lynn Langit
 
PPTX
Teaching Kids Programming for Developers
Lynn Langit
 
PDF
Cloud Big Data Architectures
Lynn Langit
 
VariantSpark on AWS
Lynn Langit
 
Serverless Architectures
Lynn Langit
 
10+ Years of Teaching Kids Programming
Lynn Langit
 
Blastn plus jupyter on Docker
Lynn Langit
 
Testing in Ballerina Language
Lynn Langit
 
Teaching Kids to create Alexa Skills
Lynn Langit
 
Practical cloud
Lynn Langit
 
Understanding Jupyter notebooks using bioinformatics examples
Lynn Langit
 
Genome-scale Big Data Pipelines
Lynn Langit
 
Teaching Kids Programming
Lynn Langit
 
Practical Cloud
Lynn Langit
 
Serverless Reality
Lynn Langit
 
Genomic Scale Big Data Pipelines
Lynn Langit
 
VariantSpark - a Spark library for genomics
Lynn Langit
 
Bioinformatics Data Pipelines built by CSIRO on AWS
Lynn Langit
 
Google Cloud and Data Pipeline Patterns
Lynn Langit
 
Redis Labs and SQL Server
Lynn Langit
 
What is 'Teaching Kids Programming'
Lynn Langit
 
Teaching Kids Programming for Developers
Lynn Langit
 
Cloud Big Data Architectures
Lynn Langit
 
Ad

Recently uploaded (20)

PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PPTX
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
Market Insight : ETH Dominance Returns
CIFDAQ
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PPTX
Agentic AI in Healthcare Driving the Next Wave of Digital Transformation
danielle hunter
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Market Insight : ETH Dominance Returns
CIFDAQ
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
Agentic AI in Healthcare Driving the Next Wave of Digital Transformation
danielle hunter
 

Building a data warehouse with AWS Redshift, Matillion and Yellowfin

  • 1. Building a Data Warehouse on AWS Amazon S3 Amazon Redshift CollectCollect ProcessProcess AnalyzeAnalyze StoreStore Data Answers Visualize @Lynn Langit
  • 2. AWS Marketplace Enterprise software store for business users who need simplified procurement •2000+ product listings •to browse, test and buy software •1-click deployment •to launch, in multiple regions around the world •Pay-as-you-go pricing •to use on demand Advanced Analytics Data Enablement Business Intelligence
  • 3. Building a Data Warehouse on AWS Move data into Redshift from S3 for analysis Amazon S3 Amazon Redshift AWS Marketplace Partners Matillion Visualize Yellowfin CollectCollect ProcessProcess AnalyzeAnalyze StoreStore Data Answers
  • 5. Our Scenario and Source Files File Types -- Text - .csv -- Compressed - .gz File Categories Details / Events -- Flights -- Weather Metadata -- Airports -- Carriers “In this scenario we will use Matillion ETL for Redshift to prepare two separate data sources ready for analysis. The sample data is US airport flight information from 1995 -> 2008. Every flight to or from a US airport (and whether it left on time or not) is included. The second data set is weather data, taken from NOAA, including the daily weather readings for each US Airport.”
  • 6. Loading data from S3 in to Redshift
  • 7. Using Matillion ETL for Redshift • Create Instance (AMI/EC2) of Matillion/AWS Marketplace • Connect Matillion to Redshift
  • 9. Table distribution styles Distribution Key All Node 1 Slice 1 Slice 1 Slice 2 Slice 2 Node 2 Slice 3 Slice 3 Slice 4 Slice 4 Node 1 Slice 1 Slice 1 Slice 2 Slice 2 Node 2 Slice 3 Slice 3 Slice 4 Slice 4 key1 key2 key3 key4 All data on every node Same key to same location Node 1 Slice 1 Slice 1 Slice 2 Slice 2 Node 2 Slice 3 Slice 3 Slice 4 Slice 4 Even Round robin distribution
  • 10. Sort Keys • Single Column - [ SORTKEY ( date ) ] • Queries that use 1st column (i.e. date) as primary filter • Compound - [ SORTKEY COMPOUND ( date, region, country) ] • Queries that use 1st column as primary filter, then other columns • Interleaved - [ SORTKEY INTERLEAVED ( date, region, country) ] • Queries that use different columns in filter
  • 11. Time Series Data – Vacuum Operation Unsorted Region Sorted Region Sorted Sorted Sorted Append in Sort Key Order Sort Unsorted Region Merge

Editor's Notes

  • #2: Collect logs in an Amazon Kinesis Stream Launch Amazon EMR and Amazon Redshift clusters Use Hive on Amazon EMR to access data in an Amazon Kinesis stream Use Hive on Amazon EMR to transform, partition and output data to Amazon S3 Load data in parallel into Amazon Redshift from Amazon S3 Bonus: use Hive and Amazon DynamoDB to enable Amazon Kinesis “checkpointing”
  • #3: Big Data software on AWS Marketplace:http://amzn.to/1va4KQ6
  • #6: Public data from -- s3://demo-data-sets-west/airline/data/
  • #7: http://docs.aws.amazon.com/general/latest/gr/rande.html http://docs.aws.amazon.com/redshift/latest/dg/r_STV_SLICES.html
  • #10: Redshift is a distributed system: A cluster contains a leader node and compute nodes A compute node contains slices (one per core) that contain data Data is distributed among slices in 3 ways: Even – Rows distributed in Round Robin fashion (default) Key – Rows distributed based on a distribution key (hash of a defined column) All - Rows distributed to all slices Queries run on all slices in parallel Optimal query throughput can be achieved when data is evenly spread across slices
  • #12: When you append data, it’s appended to the unsorted region in sorted order When you vacuum, the unsorted region is sorted first, then merged into the sorted regions This can be really expensive If you append data only in the order of your sortkeys, you’ll never have to vacuum Mycroft does this automatically