SlideShare a Scribd company logo
Amazon Redshift
Spend time with your data, not your database….
Data Warehouse Challenges
Cost
Complexity
Performance
Rigidity
1990 2000 2010 2020
Enterprise Data Data in Warehouse
Amazon Redshift powers Clickstream Analytics for
Amazon.com
• Web log analysis for Amazon.com
– Petabyte workload
– Largest table: 400 TB
• Understand customer behavior
– Who is browsing but not buying
– Which products/features are winners
– What sequence led to higher customer conversion
• Solution
– Best scale-out solution—query across 1 week
– Hadoop—query across 1 month
Amazon Redshift benefits realized
• Performance
– Scan 2.25 trillion rows of data: 14 minutes
– Load 5 billion rows data: 10 minutes
– Backfill 150 billion rows of data: 9.75 hours
– Pig  Amazon Redshift: 2 days to 1 hr
• 10B row join with 700 M rows
– Oracle  Amazon Redshift: 90 hours to 8 hrs
• Cost
– 1.6 PB cluster
– 100 8xl HDD nodes
– $180/hr
• Complexity
– 20% time of one DBA
• Backup
• Restore
• Resizing
Expanding Amazon Redshift
Functionality
Scalar User-Defined Functions (UDF)
• Scalar UDFs using Python 2.7
– Return single result value for each input value
– Executed in parallel across cluster
– Syntax largely identical to PostgreSQL
– We reserve any function with f_ for customers
• Pandas, NumPy, SciPy pre-installed
– Do matrix operations, build optimization algorithms, and run
statistical analyses
– Build end-to-end modeling workflow
• Import your own libraries
CREATE FUNCTION f_function_name
( [ argument_name arg_type, ... ] )
RETURNS data_type
{ VOLATILE | STABLE | IMMUTABLE }
AS $$
python_program
$$ LANGUAGE plpythonu;
Scalar UDF Security
• Run in restricted container that is fully isolated
– Cannot make system and network calls
– Cannot corrupt your cluster or negatively impact its performance
• Current limitations
– Can’t access file system - functions that write files won’t work
– Don’t yet cache stable and immutable functions
– Slower than built-in functions compiled to machine code
• Haven’t fully optimized some cases, including nested functions
Scalar UDF example - URL parsing
CREATE FUNCTION f_hostname (url varchar)
RETURNS varchar
IMMUTABLE AS $$
import urlparse
return urlparse.urlparse(url).hostname
$$ LANGUAGE plpythonu;
SELECT f_hostname(url) FROM table;
SELECT REGEXP_REPLACE(url, '(https?)://([^@]*@)?([^:/]*)([/:].*|$)', ‘3') FROM table;
Scalar UDF example – Distance
CREATE FUNCTION f_distance (orig_lat float, orig_long float, dest_lat float, dest_long float)
RETURNS float
STABLE
AS $$
import math
r = 3963.1676 # earth's radius, in miles
phi_orig = math.radians(orig_lat)
phi_dest = math.radians(dest_lat)
delta_lat = math.radians(dest_lat - orig_lat)
delta_long = math.radians(dest_long - orig_long)
a = math.sin(delta_lat/2) * math.sin(delta_lat/2) + math.cos(phi_orig) 
* math.cos(phi_dest) * math.sin(delta_long/2) * math.sin(delta_long/2)
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
d = r * c
return d
$$ LANGUAGE plpythonu;
Redshift Github UDF Repository
Script Purpose
f_encryption.sql
Uses pyaes library to encrypt/decrypt strings
using passphrase
f_next_business_day.sql
Uses pandas library to return dates which
are US Federal Holiday aware
f_null_syns.sql
Uses python sets to match strings, similar to
a SQL IN condition
f_parse_url_query_string.sql
Uses urlparse to parse the field-value pairs
from a url query string
f_parse_xml.sql Uses xml.etree.ElementTree to parse XML
f_unixts_to_timestamp.sql
Uses pandas library to convert a unix
timestamp to UTC datetime
github.com/awslabs/amazon-redshift-udfs
Amazon Kinesis Firehose to Amazon Redshift
Load massive volumes of streaming data into Amazon Redshift
• Zero administration: Capture and deliver streaming data into Redshift without writing an
application
• Direct-to-data store integration: Batch, compress, and encrypt streaming data for delivery
• Seamless elasticity: Seamlessly scales to match data throughput w/o intervention
Capture and submit
streaming data to Firehose
Firehose loads streaming data
continuously into S3 and Redshift
Analyze streaming data using Chartio
• Uses your S3 bucket as an intermediate destination
• S3 bucket has ‘manifests’ folder – holds manifest of files to be copied
• Issues COPY command synchronously
• Single delivery stream loads into a single Redshift cluster, database, and table
• Continuously issues COPY once previous one is finished
• Frequency of COPYs determined by how fast your cluster can load files
• No partial loads. If a single record fails, whole file or batch fails
• Info on skipped files delivered to S3 bucket as manifest in errors folder
• If cannot reach cluster, retries every 5 min for 60 min and then moves on to next batch of
objects
Amazon Kinesis Firehose to Amazon Redshift
Multi-Column Sort
• Compound sort keys
– Filter data by one leading column
• Interleaved sort keys
– Filter data by up to eight columns
– No storage overhead, unlike an index or projection
– Lower maintenance penalty
Compound sort keys illustrated
• Four records fill a
block, sorted by
customer
• Records with a given
customer are all in one
block.
• Records with a given
product are spread
across four blocks.
1
1
1
1
2
3
4
1
4
4
4
2
3
4
4
1
3
3
3
2
3
4
3
1
2
2
2
2
3
4
2
1
1 [1,1] [1,2] [1,3] [1,4]
2 [2,1] [2,2] [2,3] [2,4]
3 [3,1] [3,2] [3,3] [3,4]
4 [4,1] [4,2] [4,3] [4,4]
1 2 3 4
prod_id
cust_id
cust_id prod_id other columns blocks
1 [1,1] [1,2] [1,3] [1,4]
2 [2,1] [2,2] [2,3] [2,4]
3 [3,1] [3,2] [3,3] [3,4]
4 [4,1] [4,2] [4,3] [4,4]
1 2 3 4
prod_id
cust_id
Interleaved sort keys illustrated
• Records with a given
customer are spread
across two blocks.
• Records with a given
product are also spread
across two blocks.
• Both keys are equal.
1
1
2
2
2
1
2
3
3
4
4
4
3
4
3
1
3
4
4
2
1
2
3
3
1
2
2
4
3
4
1
1
cust_id prod_id other columns blocks
Interleaved Sort Key Considerations
• Vacuum time can increase by 10-50% for interleaved sort keys vs.
compound keys
• If data increases monotonically, such as dates, interleaved sort order
will skew over time
– You’ll need to run a vacuum operation to re-analyze the distribution and re-sort
the data.
• Query filtering on the leading sort column, runs faster using
compound sort keys vs. interleaved
SAN FRANCISCO
Questions/Comments?
Please contact us at redshift-feedback@amazon.com

More Related Content

What's hot (18)

PDF
AWS Segment XO Group Joint webinar
Arti Bhatia
 
PDF
Cost-Based Optimizer in Apache Spark 2.2
Databricks
 
PPTX
HBaseConEast2016: HBase and Spark, State of the Art
Michael Stack
 
PDF
Redshift deep dive
Amazon Web Services LATAM
 
PPTX
LLAP: Locality is dead (in the cloud)
Future of Data Meetup
 
PDF
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Databricks
 
PDF
DataEngConf SF16 - Collecting and Moving Data at Scale
Hakka Labs
 
PDF
Apache Spark Tutorial
Farzad Nozarian
 
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
PDF
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PPTX
Columnar Table Performance Enhancements Of Greenplum Database with Block Meta...
Ontico
 
PDF
phoenix-on-calcite-nyc-meetup
Maryann Xue
 
PDF
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit
 
PDF
Meet Hadoop Family: part 4
caizer_x
 
PDF
SparkR - Play Spark Using R (20160909 HadoopCon)
wqchen
 
PPTX
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Accumulo Summit
 
PDF
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Julian Hyde
 
PPTX
Apache PIG
Prashant Gupta
 
AWS Segment XO Group Joint webinar
Arti Bhatia
 
Cost-Based Optimizer in Apache Spark 2.2
Databricks
 
HBaseConEast2016: HBase and Spark, State of the Art
Michael Stack
 
Redshift deep dive
Amazon Web Services LATAM
 
LLAP: Locality is dead (in the cloud)
Future of Data Meetup
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Databricks
 
DataEngConf SF16 - Collecting and Moving Data at Scale
Hakka Labs
 
Apache Spark Tutorial
Farzad Nozarian
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Columnar Table Performance Enhancements Of Greenplum Database with Block Meta...
Ontico
 
phoenix-on-calcite-nyc-meetup
Maryann Xue
 
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit
 
Meet Hadoop Family: part 4
caizer_x
 
SparkR - Play Spark Using R (20160909 HadoopCon)
wqchen
 
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Accumulo Summit
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Julian Hyde
 
Apache PIG
Prashant Gupta
 

Viewers also liked (9)

PPTX
Optimize Your Reporting In Less Than 10 Minutes
Alexandra Sasha Blumenfeld
 
PPTX
Using cohort analysis to understand your SaaS business | Growth Hacking Brussels
Universem
 
PPTX
The Vital Metrics Every Sales Team Should Be Measuring
Chartio
 
PPTX
How To Drive Exponential Growth Using Unconventional Data Sources
Chartio
 
PPTX
Producing and Analyzing Rich Data with PostgreSQL
Chartio
 
PDF
From Data to Insight: Uncovering the 'Aha' Moments That Matter
Qualtrics
 
PPTX
Learn How to Run Python on Redshift
Chartio
 
PPTX
Using the PostgreSQL Extension Ecosystem for Advanced Analytics
Chartio
 
PPTX
WHAT DATA DO YOU NEED TO BUILD A COMPREHENSIVE HEALTH SCORE?
Totango
 
Optimize Your Reporting In Less Than 10 Minutes
Alexandra Sasha Blumenfeld
 
Using cohort analysis to understand your SaaS business | Growth Hacking Brussels
Universem
 
The Vital Metrics Every Sales Team Should Be Measuring
Chartio
 
How To Drive Exponential Growth Using Unconventional Data Sources
Chartio
 
Producing and Analyzing Rich Data with PostgreSQL
Chartio
 
From Data to Insight: Uncovering the 'Aha' Moments That Matter
Qualtrics
 
Learn How to Run Python on Redshift
Chartio
 
Using the PostgreSQL Extension Ecosystem for Advanced Analytics
Chartio
 
WHAT DATA DO YOU NEED TO BUILD A COMPREHENSIVE HEALTH SCORE?
Totango
 
Ad

Similar to Redshift Chartio Event Presentation (20)

PDF
Melhores práticas de data warehouse no Amazon Redshift
Amazon Web Services LATAM
 
PPTX
Redshift overview
Amazon Web Services LATAM
 
PDF
London Redshift Meetup - July 2017
Pratim Das
 
PPTX
Building an Enterprise Data Environment in the Cloud
Steve Fischer
 
PDF
How to Fine-Tune Performance Using Amazon Redshift
AWS Germany
 
PDF
Data infrastructure for the other 90% of companies
Martin Loetzsch
 
PDF
Really Big Elephants: PostgreSQL DW
PostgreSQL Experts, Inc.
 
PDF
_Super_Study_Guide__Data_Science_Tools_1620233377.pdf
nielitjanarthanam
 
PPTX
AWS (Amazon Redshift) presentation
Volodymyr Rovetskiy
 
PPTX
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Sam Palani
 
PPTX
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
Jie Li
 
PDF
Amazon RedShift - Ianni Vamvadelis
huguk
 
PPTX
Introdução ao Data Warehouse Amazon Redshift
Amazon Web Services LATAM
 
PPTX
AWS Redshift Introduction - Big Data Analytics
Keeyong Han
 
PDF
World-class Data Engineering with Amazon Redshift
Lars Kamp
 
PPTX
Choosing data warehouse considerations
Aseem Bansal
 
PPTX
Lessons learned mongodb to redhsift - meetup July 1st Tel Aviv
Roie Shavit
 
PDF
AWS Innovate: Running Databases in AWS- Russell Nash
Amazon Web Services Korea
 
PDF
Amazon Redshift For Data Analysts
Can Abacıgil
 
PDF
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dan Lynn
 
Melhores práticas de data warehouse no Amazon Redshift
Amazon Web Services LATAM
 
Redshift overview
Amazon Web Services LATAM
 
London Redshift Meetup - July 2017
Pratim Das
 
Building an Enterprise Data Environment in the Cloud
Steve Fischer
 
How to Fine-Tune Performance Using Amazon Redshift
AWS Germany
 
Data infrastructure for the other 90% of companies
Martin Loetzsch
 
Really Big Elephants: PostgreSQL DW
PostgreSQL Experts, Inc.
 
_Super_Study_Guide__Data_Science_Tools_1620233377.pdf
nielitjanarthanam
 
AWS (Amazon Redshift) presentation
Volodymyr Rovetskiy
 
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Sam Palani
 
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
Jie Li
 
Amazon RedShift - Ianni Vamvadelis
huguk
 
Introdução ao Data Warehouse Amazon Redshift
Amazon Web Services LATAM
 
AWS Redshift Introduction - Big Data Analytics
Keeyong Han
 
World-class Data Engineering with Amazon Redshift
Lars Kamp
 
Choosing data warehouse considerations
Aseem Bansal
 
Lessons learned mongodb to redhsift - meetup July 1st Tel Aviv
Roie Shavit
 
AWS Innovate: Running Databases in AWS- Russell Nash
Amazon Web Services Korea
 
Amazon Redshift For Data Analysts
Can Abacıgil
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dan Lynn
 
Ad

Recently uploaded (20)

PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
July Patch Tuesday
Ivanti
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
July Patch Tuesday
Ivanti
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 

Redshift Chartio Event Presentation

  • 1. Amazon Redshift Spend time with your data, not your database….
  • 2. Data Warehouse Challenges Cost Complexity Performance Rigidity 1990 2000 2010 2020 Enterprise Data Data in Warehouse
  • 3. Amazon Redshift powers Clickstream Analytics for Amazon.com • Web log analysis for Amazon.com – Petabyte workload – Largest table: 400 TB • Understand customer behavior – Who is browsing but not buying – Which products/features are winners – What sequence led to higher customer conversion • Solution – Best scale-out solution—query across 1 week – Hadoop—query across 1 month
  • 4. Amazon Redshift benefits realized • Performance – Scan 2.25 trillion rows of data: 14 minutes – Load 5 billion rows data: 10 minutes – Backfill 150 billion rows of data: 9.75 hours – Pig  Amazon Redshift: 2 days to 1 hr • 10B row join with 700 M rows – Oracle  Amazon Redshift: 90 hours to 8 hrs • Cost – 1.6 PB cluster – 100 8xl HDD nodes – $180/hr • Complexity – 20% time of one DBA • Backup • Restore • Resizing
  • 6. Scalar User-Defined Functions (UDF) • Scalar UDFs using Python 2.7 – Return single result value for each input value – Executed in parallel across cluster – Syntax largely identical to PostgreSQL – We reserve any function with f_ for customers • Pandas, NumPy, SciPy pre-installed – Do matrix operations, build optimization algorithms, and run statistical analyses – Build end-to-end modeling workflow • Import your own libraries CREATE FUNCTION f_function_name ( [ argument_name arg_type, ... ] ) RETURNS data_type { VOLATILE | STABLE | IMMUTABLE } AS $$ python_program $$ LANGUAGE plpythonu;
  • 7. Scalar UDF Security • Run in restricted container that is fully isolated – Cannot make system and network calls – Cannot corrupt your cluster or negatively impact its performance • Current limitations – Can’t access file system - functions that write files won’t work – Don’t yet cache stable and immutable functions – Slower than built-in functions compiled to machine code • Haven’t fully optimized some cases, including nested functions
  • 8. Scalar UDF example - URL parsing CREATE FUNCTION f_hostname (url varchar) RETURNS varchar IMMUTABLE AS $$ import urlparse return urlparse.urlparse(url).hostname $$ LANGUAGE plpythonu; SELECT f_hostname(url) FROM table; SELECT REGEXP_REPLACE(url, '(https?)://([^@]*@)?([^:/]*)([/:].*|$)', ‘3') FROM table;
  • 9. Scalar UDF example – Distance CREATE FUNCTION f_distance (orig_lat float, orig_long float, dest_lat float, dest_long float) RETURNS float STABLE AS $$ import math r = 3963.1676 # earth's radius, in miles phi_orig = math.radians(orig_lat) phi_dest = math.radians(dest_lat) delta_lat = math.radians(dest_lat - orig_lat) delta_long = math.radians(dest_long - orig_long) a = math.sin(delta_lat/2) * math.sin(delta_lat/2) + math.cos(phi_orig) * math.cos(phi_dest) * math.sin(delta_long/2) * math.sin(delta_long/2) c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a)) d = r * c return d $$ LANGUAGE plpythonu;
  • 10. Redshift Github UDF Repository Script Purpose f_encryption.sql Uses pyaes library to encrypt/decrypt strings using passphrase f_next_business_day.sql Uses pandas library to return dates which are US Federal Holiday aware f_null_syns.sql Uses python sets to match strings, similar to a SQL IN condition f_parse_url_query_string.sql Uses urlparse to parse the field-value pairs from a url query string f_parse_xml.sql Uses xml.etree.ElementTree to parse XML f_unixts_to_timestamp.sql Uses pandas library to convert a unix timestamp to UTC datetime github.com/awslabs/amazon-redshift-udfs
  • 11. Amazon Kinesis Firehose to Amazon Redshift Load massive volumes of streaming data into Amazon Redshift • Zero administration: Capture and deliver streaming data into Redshift without writing an application • Direct-to-data store integration: Batch, compress, and encrypt streaming data for delivery • Seamless elasticity: Seamlessly scales to match data throughput w/o intervention Capture and submit streaming data to Firehose Firehose loads streaming data continuously into S3 and Redshift Analyze streaming data using Chartio
  • 12. • Uses your S3 bucket as an intermediate destination • S3 bucket has ‘manifests’ folder – holds manifest of files to be copied • Issues COPY command synchronously • Single delivery stream loads into a single Redshift cluster, database, and table • Continuously issues COPY once previous one is finished • Frequency of COPYs determined by how fast your cluster can load files • No partial loads. If a single record fails, whole file or batch fails • Info on skipped files delivered to S3 bucket as manifest in errors folder • If cannot reach cluster, retries every 5 min for 60 min and then moves on to next batch of objects Amazon Kinesis Firehose to Amazon Redshift
  • 13. Multi-Column Sort • Compound sort keys – Filter data by one leading column • Interleaved sort keys – Filter data by up to eight columns – No storage overhead, unlike an index or projection – Lower maintenance penalty
  • 14. Compound sort keys illustrated • Four records fill a block, sorted by customer • Records with a given customer are all in one block. • Records with a given product are spread across four blocks. 1 1 1 1 2 3 4 1 4 4 4 2 3 4 4 1 3 3 3 2 3 4 3 1 2 2 2 2 3 4 2 1 1 [1,1] [1,2] [1,3] [1,4] 2 [2,1] [2,2] [2,3] [2,4] 3 [3,1] [3,2] [3,3] [3,4] 4 [4,1] [4,2] [4,3] [4,4] 1 2 3 4 prod_id cust_id cust_id prod_id other columns blocks
  • 15. 1 [1,1] [1,2] [1,3] [1,4] 2 [2,1] [2,2] [2,3] [2,4] 3 [3,1] [3,2] [3,3] [3,4] 4 [4,1] [4,2] [4,3] [4,4] 1 2 3 4 prod_id cust_id Interleaved sort keys illustrated • Records with a given customer are spread across two blocks. • Records with a given product are also spread across two blocks. • Both keys are equal. 1 1 2 2 2 1 2 3 3 4 4 4 3 4 3 1 3 4 4 2 1 2 3 3 1 2 2 4 3 4 1 1 cust_id prod_id other columns blocks
  • 16. Interleaved Sort Key Considerations • Vacuum time can increase by 10-50% for interleaved sort keys vs. compound keys • If data increases monotonically, such as dates, interleaved sort order will skew over time – You’ll need to run a vacuum operation to re-analyze the distribution and re-sort the data. • Query filtering on the leading sort column, runs faster using compound sort keys vs. interleaved

Editor's Notes

  • #8: Can’t add a file. when you run something like python script.py, the script is converted to bytecode and then the interpreter/VM/CPython–really just a C Program–reads in the python bytecode and executes the program accordingly. Not translated to machine code Other implementations, like Pypy, have JIT compilation, i.e. they translate Python to machine codes on the fly.
  • #9: Urlparse is part of built-in libraries from python.
  • #10: haversine
  • #12: The data producer sends data blobs as large as 1,000 KB to a delivery stream.
  • #16: 1,000,000 blocks (1 TB per column) with an interleaved sort key of both customer ID and page ID, you scan 1,000 blocks when you filter on a specific customer or page, a speedup of 1000x compared to the unsorted case.