Building a data warehouse with AWS Redshift, Matillion and Yellowfin

Building a Data Warehouse on AWS
Amazon
S3
Amazon
Redshift
CollectCollect ProcessProcess AnalyzeAnalyze
StoreStore
Data Answers
Visualize
@Lynn Langit

AWS Marketplace
Enterprise software store for business users who need simplified procurement
•2000+ product listings
•to browse, test and buy software
•1-click deployment
•to launch, in multiple regions around the
world
•Pay-as-you-go pricing
•to use on demand
Advanced Analytics
Data Enablement
Business Intelligence

Building a Data Warehouse on AWS
Move data into Redshift
from S3 for analysis
Amazon
S3
Amazon
Redshift
AWS Marketplace
Partners
Matillion
Visualize
Yellowfin
CollectCollect ProcessProcess AnalyzeAnalyze
StoreStore
Data Answers

Our Scenario and Source Files
File Types
-- Text - .csv
-- Compressed - .gz
File Categories
Details / Events
-- Flights
-- Weather
Metadata
-- Airports
-- Carriers
“In this scenario we will use Matillion ETL
for Redshift to prepare two separate data
sources ready for analysis.
The sample data is US airport flight
information from 1995 -> 2008. Every flight
to or from a US airport (and whether it left
on time or not) is included.
The second data set is weather data, taken
from NOAA, including the daily weather
readings for each US Airport.”

Loading data from S3 in to Redshift

Using Matillion ETL for Redshift
• Create Instance (AMI/EC2) of Matillion/AWS Marketplace
• Connect Matillion to Redshift

Table distribution styles
Distribution Key All
Node 1
Slice
1
Slice
1
Slice
2
Slice
2
Node 2
Slice
3
Slice
3
Slice
4
Slice
4
Node 1
Slice
1
Slice
1
Slice
2
Slice
2
Node 2
Slice
3
Slice
3
Slice
4
Slice
4
key1
key2
key3
key4
All data on
every node
Same key to same location
Node 1
Slice
1
Slice
1
Slice
2
Slice
2
Node 2
Slice
3
Slice
3
Slice
4
Slice
4
Even
Round robin
distribution

Sort Keys
• Single Column - [ SORTKEY ( date ) ]
• Queries that use 1st
column (i.e. date) as primary filter
• Compound - [ SORTKEY COMPOUND ( date, region,
country) ]
• Queries that use 1st
column as primary filter, then other columns
• Interleaved - [ SORTKEY INTERLEAVED ( date,
region, country) ]
• Queries that use different columns in filter

Time Series Data – Vacuum Operation
Unsorted
Region
Sorted
Region
Sorted
Sorted
Sorted
Append in Sort Key Order
Sort Unsorted
Region
Merge

Automate – https://github.com/lynnlangit/AWSDataWarehouse

Building a data warehouse with AWS Redshift, Matillion and Yellowfin

More Related Content

What's hot (7)

Similar to Building a data warehouse with AWS Redshift, Matillion and Yellowfin (6)

More from Lynn Langit (20)

Recently uploaded (20)

Building a data warehouse with AWS Redshift, Matillion and Yellowfin

Editor's Notes