SlideShare a Scribd company logo
Amazon Redshift
shift
Fast, simple, petabyte-scale data warehousing for $1,000/TB/Year
150+ features
Pratim Das
Specialist Solutions Architect, Data & Analytics,
EMEA
Redshift Architecture
Managed Massively Parallel Petabyte Scale
Data Warehouse
Streaming Backup/Restore to S3
Load data from S3, DynamoDB and EMR
Extensive Security Features
Scale from 160 GB -> 2 PB Online
Fast
CompatibleSecure
ElasticSimple
Cost
Efficient
Amazon Redshift Cluster Architecture
Massively parallel, shared nothing
Leader node
• SQL endpoint
• Stores metadata
• Coordinates parallel SQL processing
Compute nodes
• Local, columnar storage
• Executes queries in parallel
• Load, backup, restore
• 2, 16 or 32 slices
10 GigE
(HPC)
Ingestion
Backup
Restore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
S3 / EMR / DynamoDB / SSH
JDBC/ODBC
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
Leader
Node
Use Case: Traditional Data Warehousing
Business
Reporting
Advanced pipelines
and queries
Secure and
Compliant
Easy Migration – Point & Click using AWS Database Migration Service
Secure & Compliant – End-to-End Encryption. SOC 1/2/3, PCI-DSS, HIPAA and FedRAMP compliant
Large Ecosystem – Variety of cloud and on-premises BI and ETL tools
Japanese Mobile
Phone Provider
Powering 100 marketplaces
in 50 countries
World’s Largest Children’s
Book Publisher
Bulk Loads
and Updates
Use Case: Log Analysis
Log & Machine
IOT Data
Clickstream
Events Data
Time-Series
Data
Cheap – Analyze large volumes of data cost-effectively
Fast – Massively Parallel Processing (MPP) and columnar architecture for fast queries and parallel loads
Near real-time – Micro-batch loading and Amazon Kinesis Firehose for near-real time analytics
Interactive data analysis and
recommendation engine
Ride analytics for pricing
and product development
Ad prediction and
on-demand analytics
Use Case: Business Applications
Multi-Tenant BI
Applications
Back-end
services
Analytics as a
Service
Fully Managed – Provisioning, backups, upgrades, security, compression all come built-in so you can
focus on your business applications
Ease of Chargeback – Pay as you go, add clusters as needed. A few big common clusters, several
data marts
Service Oriented Architecture – Integrated with other AWS services. Easy to plug into your pipeline
Infosys Information
Platform (IIP)
Analytics-as-a-
Service
Product and Consumer
Analytics
Effective use of MPP
architecture
Design for Queryability
• Equally on each slice
• Minimum amount of work
• Use just enough cluster resources
Do an Equal Amount of Work
on Each Slice
Choose Best Table Distribution Style
All
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
All data on
every node
Key
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Same key to
same location
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Even
Round robin
distribution
Do the Minimum Amount of
Work on Each Slice
Columnar storage
+
Large data block sizes
+
Data compression
+
Zone maps
+
Direct-attached storage
analyze compression listing;
Table | Column | Encoding
---------+----------------+----------
listing | listid | delta
listing | sellerid | delta32k
listing | eventid | delta32k
listing | dateid | bytedict
listing | numtickets | bytedict
listing | priceperticket | delta32k
listing | totalprice | mostly32
listing | listtime | raw
10 | 13 | 14 | 26 |…
… | 100 | 245 | 324
375 | 393 | 417…
… 512 | 549 | 623
637 | 712 | 809 …
… | 834 | 921 | 959
10
324
375
623
637
959
Reduced I/O = Enhanced Performance
Use Cluster Resources
Efficiently to Complete as
Quickly as Possible
Amazon Redshift Workload Management
Waiting
Workload Management
BI tools
SQL clients
Analytics tools
Client
Running
Queries: 80% memory
ETL: 20% memory
4 Slots
2 Slots
80/4 = 20% per slot
20/2 = 10% per slot
Query monitoring rules
Common use cases:
• Protect interactive queues
INTERACTIVE = { “query_execution_time > 15 sec” or
“query_cpu_time > 1500 uSec” or
”query_blocks_read > 18000 blocks” } [HOP]
• Monitor ad-hoc queues for heavy queries
AD-HOC = { “query_execution_time > 120” or
“query_cpu_time > 3000” or
”query_blocks_read > 180000” or
“memory_to_disk > 400000000000”} [LOG]
• Limit the number of rows returned to a client
MAXLINES = { “RETURN_ROWS > 50000” } [ABORT]
Redshift Performance Tuning
Redshift Playbook
Part 1: Preamble, Prerequisites, and
Prioritization
Part 2: Distribution Styles and
Distribution Keys
Part 3: Compound and Interleaved
Sort Keys
Part 4: Compression Encodings
Part 5: Table Data Durability
amzn.to/2quChdM
Optimizing Amazon Redshift by Using the AWS
Schema Conversion Tool
amzn.to/2sTYow1
Ingestion, ETL & BI
Getting data to Redshift using AWS DMS
Simple to use Minimal Downtime Supports most widely
used Databases
Low Cost Fast & Easy to Set-up Reliable
Loading data from S3
• Splitting Your Data into Multiple Files
• Uploading Files to Amazon S3
• Using the COPY Command to Load from
Amazon S3
ETL on Redshift
QuickSight for BI on Redshift
Amazon Redshift
Querying Redshift with R Packages
• RJDBC – supports SQL queries
• dplyr – Uses R code for data
analysis
• RPostgreSQL - R compliant
driver or Database Interface
(DBI)
R User
R Studio
Amazon
EC2
Unstructured Data
Amazon S3
User Profile
Amazon RDS
Amazon Redshift
Connecting R with Amazon Redshift blog post: https://blogs.aws.amazon.com/bigdata/post/Tx1G8828SPGX3PK/Connecting-R-with-Amazon-Redshift
Statistical UDF Example
CREATE FUNCTION f_z_test_by_pval (alpha float,
x_bar float, test_val float, sigma float, n float)
RETURNS varchar
STABLE AS $$
import scipy.stats as st
import math as math
z = (x_bar - test_val) / (sigma /
math.sqrt(n))
p = st.norm.cdf(z)
if p <= alpha:
return 'Statistically significant'
else:
return 'May have occurred by random chance'
$$ LANGUAGE plpythonu;
Introduction to Python UDFs in Amazon Redshift: https://blogs.aws.amazon.com/bigdata/post/Tx1IHV1G67CY53T/Introduction-to-Python-
UDFs-in-Amazon-Redshift
Application Developers can build smart
applications using Amazon Machine
Learning
Structured Data/Predictions
Amazon Redshift
Generate/Query
Predictions
Amazon QuickSight
Application
Amazon Machine
Learning
Visualize
• All skill levels
• Machine Learning technology is accessed through APIs / SDKs
• Embed visualizations in applications
Structured data
In Amazon Redshift
Load predictions into
Amazon Redshift
-or-
Read prediction results
directly from Amazon S3
Predictions
in Amazon S3
Query for predictions with
Amazon ML batch API
Your application
Batch predictions with Amazon Redshift
New Features Since Last
Meetup
https://aws.amazon.com/redshift/whats-new/
AWS Schema Conversion Tool Exports from
SQL Server, Oracle and Teradata to Amazon
Redshift
AWS Schema Conversion Tool (SCT) can now extract data
from a Microsoft SQL Server data warehouse for direct
import into Amazon Redshift. This follows the recently
announced capability to convert SQL Server data
warehouse schemas.
AWS Schema Conversion Tool (SCT) can now also extract
data from Teradata and Oracle data warehouses for direct
import into Amazon Redshift.
Encrypting unloaded data using Amazon S3
server-side encryption with AWS KMS keys
The Amazon Redshift UNLOAD command now supports Amazon S3
server-side encryption using an AWS KMS key. The UNLOAD
command unloads the results of a query to one or more files on
Amazon S3. You can let Amazon Redshift automatically encrypt your
data files using Amazon S3 server-side encryption, or you can specify
a symmetric encryption key that you manage. With this release, you
can use Amazon S3 server-side encryption with a key managed by
AWS KMS. In addition, the COPY command loads Amazon S3 server-
side encrypted data files without requiring you to provide the key.
Zstandard for high data compression encoding
Amazon Redshift now supports Zstandard (ZSTD) column
compression encoding, which delivers better data
compression thereby reducing the amount of storage and
I/O needed. With the addition of ZSTD, Amazon Redshift
now offers seven compression encodings to choose from
depending on your dataset.
Query monitoring rules (QMR)
You can use query monitoring rules feature to set metrics-based
performance boundaries for workload management (WLM) queues,
and specify what action to take when a query goes beyond those
boundaries.
For example, for a queue that’s dedicated to short running queries, you
might create a rule that aborts queries that run for more than 60
seconds. To track poorly designed queries, you might have another rule
that logs queries that contain nested loops. We also provide pre-
defined rule templates in the Amazon Redshift management console to
get you started.
New Functions
STV_QUERY_METRICS displays the metrics for currently running
queries and STL_QUERY_METRICS records the metrics for
completed queries.
The new APPROXIMATE PERCENTILE_DISC function returns the
value in a list that's closest to a given percentile. Approximation
enables the function to execute much faster, with a relative error of
around 0.5 percent.
Previously available as window functions, PERCENTILE_CONT and
MEDIAN are now also available as aggregate functions.
Support for Python UDF logging module
You can now use the standard Python logging module to
log error and warning messages from Amazon Redshift
user-defined functions (UDF). You can then query the
SVL_UDF_LOG system view to retrieve the messages
logged from your UDF’s and troubleshoot your UDF’s
easily.
For more information and examples, see Logging Errors
and Warnings in UDFs in the Amazon Redshift Database
Developer Guide.
Record and govern Amazon Redshift
configurations with AWS Config
You can now record configuration changes to your Amazon Redshift
clusters with AWS Config. The detailed configuration recorded by AWS
Config includes changes made to Amazon Redshift clusters, cluster
parameter groups, cluster security groups, cluster snapshots, cluster
subnet groups, and event subscriptions. In addition, you can run two
new managed Config Rules to check whether your Amazon Redshift
clusters have the appropriate configuration and maintenance settings.
These checks include verifying that your cluster database is encrypted,
logging is enabled, snapshot data retention period is set appropriately,
and much more.
Kinesis Firehose can now prepare and
transform streaming data before loading it to
data stores
You can now configure Amazon Kinesis Firehose to prepare your
streaming data before it is loaded to data stores. With this new feature,
you can easily convert raw streaming data from your data sources into
formats required by your destination data stores, without having to
build your own data processing pipelines.
To use this feature, simply select an AWS Lambda function from the
Amazon Kinesis Firehose delivery stream configuration tab in the AWS
Management console. Amazon Kinesis Firehose will automatically
apply that function to every input data record and load the transformed
data to destinations.
Open Source: Amazon Kinesis Data Generator
(KDG)
The Amazon Kinesis Data Generator
is a UI that simplifies how you send
test data to Amazon Kinesis Streams
or Amazon Kinesis Firehose. Using
the Amazon Kinesis Data Generator,
you can create templates for your
data, create random values to use for
your data, and save the templates for
future use. bit.ly/2tgQVJc
Amazon Redshift Spectrum
Amazon Redshift Spectrum
Run SQL queries directly against data in S3 using thousands of nodes
Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query
High concurrency: Multiple
clusters access same data
No ETL: Query data in-place
using open file formats
Full Amazon Redshift
SQL support
S3
SQL
Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY…
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
1
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
Query is optimized and compiled at
the leader node. Determine what gets
run locally and what goes to Amazon
Redshift Spectrum
2
Query plan is sent to
all compute nodes
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
3
Compute nodes obtain partition info from
Data Catalog; dynamically prune partitions
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
4
Each compute node issues multiple
requests to the Amazon Redshift
Spectrum layer
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
5
Amazon Redshift Spectrum nodes
scan your S3 data
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
6
7
Amazon Redshift
Spectrum projects,
filters, joins and
aggregates
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
Final aggregations and joins
with local Amazon Redshift
tables done in-cluster
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
8
Result is sent back to client
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
9
Amazon Redshift Spectrum – Current support
File formats
• Parquet
• CSV
• Sequence
• RCFile
• ORC (coming soon)
• RegExSerDe (coming soon)
Compression
• Gzip
• Snappy
• Lzo (coming soon)
• Bz2
Encryption
• SSE with AES256
• SSE KMS with default
key
Column types
• Numeric: bigint, int, smallint, float, double
and decimal
• Char/varchar/string
• Timestamp
• Boolean
• DATE type can be used only as a
partitioning key
Table type
• Non-partitioned table
(s3://mybucket/orders/..)
• Partitioned table
(s3://mybucket/orders/date=YYYY-MM-
DD/..)
Is Amazon Redshift Spectrum useful if I don’t have an exabyte?
Your data will get bigger
On average, data warehousing volumes grow 10x every 5 years
The average Amazon Redshift customer doubles data each year
Amazon Redshift Spectrum makes data analysis simpler
Access your data without ETL pipelines
Teams using Amazon EMR, Athena & Redshift can collaborate using the same data lake
Amazon Redshift Spectrum improves availability and concurrency
Run multiple Amazon Redshift clusters against common data
Isolate jobs with tight SLAs from ad hoc analysis
Spectrum
Building a Data Strategy on AWS
Kinesis Firehose
1
2
3
4
5
6
Athena
Query Service
7
8
Glue
Batch
9
10
Redshift Partner Echo System
4 types of partners
• Load and transform your data with Data Integration
Partners
• Analyze data and share insights across your
organization with Business Intelligence Partners
• Architect and implement your analytics platform
with System Integration and Consulting Partners
• Query, explore and model your data using tools and
utilities from Query and Data Modeling Partners
aws.amazon.com/redshift/partners/
“Some” Amazon Redshift Customers
1. Analyze Database Audit Logs for Security and
Compliance Using Amazon Redshift Spectrum
2. Build a Healthcare Data Warehouse Using Amazon EMR,
Amazon Redshift, AWS Lambda, and OMOP
3. Run Mixed Workloads with Amazon Redshift Workload
Management
4. Converging Data Silos to Amazon Redshift Using AWS
DMS
5. Powering Amazon Redshift Analytics with Apache Spark
and Amazon Machine Learning
6. Using pgpool and Amazon ElastiCache for Query Caching
with Amazon Redshift
7. Extending Seven Bridges Genomics with Amazon Redshift
and R
8. Zero Admin Lambda based Redshift Loader
Architecture Tuning Integration Spectrum
Echo
System
OLX Summary
Fast
Compatible
Secure
Elastic
Simple
Cost
Efficient
amzn.to/2tlylga
amzn.to/2srIL1g
amzn.to/2rgR8Z7
amzn.to/2lr66MH
amzn.to/2kIr1bq
amzn.to/2rr7LWq
amzn.to/2szR3nf
bit.ly/2swvvI6
We are hiring!!!
www.amazon.jobs
Thank You
Data is magic!

More Related Content

PDF
Real-time serverless analytics at Shedd – OLX data summit, Mar 2018, Barcelona
Dobo Radichkov
 
PDF
OLX Group presentation for AWS Redshift meetup in London, 5 July 2017
Dobo Radichkov
 
PDF
OLX Ventures blockchain perspective, Feb 2018
Dobo Radichkov
 
PDF
Big Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQL
Matt Stubbs
 
PDF
Big Data LDN 2018: DATABASE FOR THE INSTANT EXPERIENCE
Matt Stubbs
 
PDF
Improving Transactional Applications with Analytics
DATAVERSITY
 
PDF
MongoDB Europe 2016 - Choosing Between 100 Billion Travel Options – Instant S...
MongoDB
 
PPTX
The Double win business transformation and in-year ROI and TCO reduction
MongoDB
 
Real-time serverless analytics at Shedd – OLX data summit, Mar 2018, Barcelona
Dobo Radichkov
 
OLX Group presentation for AWS Redshift meetup in London, 5 July 2017
Dobo Radichkov
 
OLX Ventures blockchain perspective, Feb 2018
Dobo Radichkov
 
Big Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQL
Matt Stubbs
 
Big Data LDN 2018: DATABASE FOR THE INSTANT EXPERIENCE
Matt Stubbs
 
Improving Transactional Applications with Analytics
DATAVERSITY
 
MongoDB Europe 2016 - Choosing Between 100 Billion Travel Options – Instant S...
MongoDB
 
The Double win business transformation and in-year ROI and TCO reduction
MongoDB
 

What's hot (20)

PPTX
The Stream is the Database - Revolutionizing Healthcare Data Architecture
DataWorks Summit/Hadoop Summit
 
PDF
Tapdata Product Intro
Tapdata
 
PDF
How Financial Services Organizations Use MongoDB
MongoDB
 
PDF
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
Big Data Spain
 
PPTX
Best Practices for MongoDB in Today's Telecommunications Market
MongoDB
 
PPT
How Retail Banks Use MongoDB
MongoDB
 
PPTX
MongoDB on Financial Services Sector
Norberto Leite
 
PPTX
Webinar: How to Drive Business Value in Financial Services with MongoDB
MongoDB
 
PPTX
How city of chicago boosts their sap business objects environment prepares fo...
Sebastien Goiffon
 
PDF
Deploying Massive Scale Graphs for Realtime Insights
Neo4j
 
PPTX
Neo4j GraphTalks - Einführung in Graphdatenbanken
Neo4j
 
PPTX
Unlocking Operational Intelligence from the Data Lake
MongoDB
 
PPTX
Webinar: How Banks Use MongoDB as a Tick Database
MongoDB
 
PPT
Real World MongoDB: Use Cases from Financial Services by Daniel Roberts
MongoDB
 
PPTX
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Dmitry Anoshin
 
PPTX
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
AWS User Group Kochi
 
PPTX
Building Modern Data Platform with AWS
Dmitry Anoshin
 
PPTX
L’architettura di Classe Enterprise di Nuova Generazione
MongoDB
 
PPTX
Webinar: How Banks Use MongoDB as a Tick Database
MongoDB
 
PDF
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
Kai Wähner
 
The Stream is the Database - Revolutionizing Healthcare Data Architecture
DataWorks Summit/Hadoop Summit
 
Tapdata Product Intro
Tapdata
 
How Financial Services Organizations Use MongoDB
MongoDB
 
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
Big Data Spain
 
Best Practices for MongoDB in Today's Telecommunications Market
MongoDB
 
How Retail Banks Use MongoDB
MongoDB
 
MongoDB on Financial Services Sector
Norberto Leite
 
Webinar: How to Drive Business Value in Financial Services with MongoDB
MongoDB
 
How city of chicago boosts their sap business objects environment prepares fo...
Sebastien Goiffon
 
Deploying Massive Scale Graphs for Realtime Insights
Neo4j
 
Neo4j GraphTalks - Einführung in Graphdatenbanken
Neo4j
 
Unlocking Operational Intelligence from the Data Lake
MongoDB
 
Webinar: How Banks Use MongoDB as a Tick Database
MongoDB
 
Real World MongoDB: Use Cases from Financial Services by Daniel Roberts
MongoDB
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Dmitry Anoshin
 
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
AWS User Group Kochi
 
Building Modern Data Platform with AWS
Dmitry Anoshin
 
L’architettura di Classe Enterprise di Nuova Generazione
MongoDB
 
Webinar: How Banks Use MongoDB as a Tick Database
MongoDB
 
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
Kai Wähner
 
Ad

Similar to London Redshift Meetup - July 2017 (9)

PDF
Module 2 - Datalake
Lam Le
 
PDF
[よくわかるAmazon Redshift]Amazon Redshift最新情報と導入事例のご紹介
Amazon Web Services Japan
 
PPTX
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
SnapLogic
 
PDF
[よくわかるAmazon Redshift in 大阪]Amazon Redshift最新情報と導入事例のご紹介
Amazon Web Services Japan
 
PDF
2017 AWS DB Day | AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?
Amazon Web Services Korea
 
PPTX
What is Amazon Redshift?
jeetendra mandal
 
PDF
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
Amazon Web Services Korea
 
PPTX
Redshift overview
Amazon Web Services LATAM
 
PDF
Amazon Elastic Map Reduce - Ian Meyers
huguk
 
Module 2 - Datalake
Lam Le
 
[よくわかるAmazon Redshift]Amazon Redshift最新情報と導入事例のご紹介
Amazon Web Services Japan
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
SnapLogic
 
[よくわかるAmazon Redshift in 大阪]Amazon Redshift最新情報と導入事例のご紹介
Amazon Web Services Japan
 
2017 AWS DB Day | AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?
Amazon Web Services Korea
 
What is Amazon Redshift?
jeetendra mandal
 
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
Amazon Web Services Korea
 
Redshift overview
Amazon Web Services LATAM
 
Amazon Elastic Map Reduce - Ian Meyers
huguk
 
Ad

Recently uploaded (20)

PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PPTX
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
PPTX
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PPTX
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PPT
Grade 5 PPT_Science_Q2_W6_Methods of reproduction.ppt
AaronBaluyut
 
PPTX
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PDF
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
short term internship project on Data visualization
JMJCollegeComputerde
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
Grade 5 PPT_Science_Q2_W6_Methods of reproduction.ppt
AaronBaluyut
 
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
blockchain123456789012345678901234567890
tanvikhunt1003
 

London Redshift Meetup - July 2017

  • 1. Amazon Redshift shift Fast, simple, petabyte-scale data warehousing for $1,000/TB/Year 150+ features Pratim Das Specialist Solutions Architect, Data & Analytics, EMEA
  • 3. Managed Massively Parallel Petabyte Scale Data Warehouse Streaming Backup/Restore to S3 Load data from S3, DynamoDB and EMR Extensive Security Features Scale from 160 GB -> 2 PB Online Fast CompatibleSecure ElasticSimple Cost Efficient
  • 4. Amazon Redshift Cluster Architecture Massively parallel, shared nothing Leader node • SQL endpoint • Stores metadata • Coordinates parallel SQL processing Compute nodes • Local, columnar storage • Executes queries in parallel • Load, backup, restore • 2, 16 or 32 slices 10 GigE (HPC) Ingestion Backup Restore SQL Clients/BI Tools 128GB RAM 16TB disk 16 cores S3 / EMR / DynamoDB / SSH JDBC/ODBC 128GB RAM 16TB disk 16 coresCompute Node 128GB RAM 16TB disk 16 coresCompute Node 128GB RAM 16TB disk 16 coresCompute Node Leader Node
  • 5. Use Case: Traditional Data Warehousing Business Reporting Advanced pipelines and queries Secure and Compliant Easy Migration – Point & Click using AWS Database Migration Service Secure & Compliant – End-to-End Encryption. SOC 1/2/3, PCI-DSS, HIPAA and FedRAMP compliant Large Ecosystem – Variety of cloud and on-premises BI and ETL tools Japanese Mobile Phone Provider Powering 100 marketplaces in 50 countries World’s Largest Children’s Book Publisher Bulk Loads and Updates
  • 6. Use Case: Log Analysis Log & Machine IOT Data Clickstream Events Data Time-Series Data Cheap – Analyze large volumes of data cost-effectively Fast – Massively Parallel Processing (MPP) and columnar architecture for fast queries and parallel loads Near real-time – Micro-batch loading and Amazon Kinesis Firehose for near-real time analytics Interactive data analysis and recommendation engine Ride analytics for pricing and product development Ad prediction and on-demand analytics
  • 7. Use Case: Business Applications Multi-Tenant BI Applications Back-end services Analytics as a Service Fully Managed – Provisioning, backups, upgrades, security, compression all come built-in so you can focus on your business applications Ease of Chargeback – Pay as you go, add clusters as needed. A few big common clusters, several data marts Service Oriented Architecture – Integrated with other AWS services. Easy to plug into your pipeline Infosys Information Platform (IIP) Analytics-as-a- Service Product and Consumer Analytics
  • 8. Effective use of MPP architecture
  • 9. Design for Queryability • Equally on each slice • Minimum amount of work • Use just enough cluster resources
  • 10. Do an Equal Amount of Work on Each Slice
  • 11. Choose Best Table Distribution Style All Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 All data on every node Key Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 Same key to same location Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 Even Round robin distribution
  • 12. Do the Minimum Amount of Work on Each Slice
  • 13. Columnar storage + Large data block sizes + Data compression + Zone maps + Direct-attached storage analyze compression listing; Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw 10 | 13 | 14 | 26 |… … | 100 | 245 | 324 375 | 393 | 417… … 512 | 549 | 623 637 | 712 | 809 … … | 834 | 921 | 959 10 324 375 623 637 959 Reduced I/O = Enhanced Performance
  • 14. Use Cluster Resources Efficiently to Complete as Quickly as Possible
  • 15. Amazon Redshift Workload Management Waiting Workload Management BI tools SQL clients Analytics tools Client Running Queries: 80% memory ETL: 20% memory 4 Slots 2 Slots 80/4 = 20% per slot 20/2 = 10% per slot
  • 16. Query monitoring rules Common use cases: • Protect interactive queues INTERACTIVE = { “query_execution_time > 15 sec” or “query_cpu_time > 1500 uSec” or ”query_blocks_read > 18000 blocks” } [HOP] • Monitor ad-hoc queues for heavy queries AD-HOC = { “query_execution_time > 120” or “query_cpu_time > 3000” or ”query_blocks_read > 180000” or “memory_to_disk > 400000000000”} [LOG] • Limit the number of rows returned to a client MAXLINES = { “RETURN_ROWS > 50000” } [ABORT]
  • 18. Redshift Playbook Part 1: Preamble, Prerequisites, and Prioritization Part 2: Distribution Styles and Distribution Keys Part 3: Compound and Interleaved Sort Keys Part 4: Compression Encodings Part 5: Table Data Durability amzn.to/2quChdM
  • 19. Optimizing Amazon Redshift by Using the AWS Schema Conversion Tool amzn.to/2sTYow1
  • 21. Getting data to Redshift using AWS DMS Simple to use Minimal Downtime Supports most widely used Databases Low Cost Fast & Easy to Set-up Reliable
  • 22. Loading data from S3 • Splitting Your Data into Multiple Files • Uploading Files to Amazon S3 • Using the COPY Command to Load from Amazon S3
  • 24. QuickSight for BI on Redshift Amazon Redshift
  • 25. Querying Redshift with R Packages • RJDBC – supports SQL queries • dplyr – Uses R code for data analysis • RPostgreSQL - R compliant driver or Database Interface (DBI) R User R Studio Amazon EC2 Unstructured Data Amazon S3 User Profile Amazon RDS Amazon Redshift Connecting R with Amazon Redshift blog post: https://blogs.aws.amazon.com/bigdata/post/Tx1G8828SPGX3PK/Connecting-R-with-Amazon-Redshift
  • 26. Statistical UDF Example CREATE FUNCTION f_z_test_by_pval (alpha float, x_bar float, test_val float, sigma float, n float) RETURNS varchar STABLE AS $$ import scipy.stats as st import math as math z = (x_bar - test_val) / (sigma / math.sqrt(n)) p = st.norm.cdf(z) if p <= alpha: return 'Statistically significant' else: return 'May have occurred by random chance' $$ LANGUAGE plpythonu; Introduction to Python UDFs in Amazon Redshift: https://blogs.aws.amazon.com/bigdata/post/Tx1IHV1G67CY53T/Introduction-to-Python- UDFs-in-Amazon-Redshift
  • 27. Application Developers can build smart applications using Amazon Machine Learning Structured Data/Predictions Amazon Redshift Generate/Query Predictions Amazon QuickSight Application Amazon Machine Learning Visualize • All skill levels • Machine Learning technology is accessed through APIs / SDKs • Embed visualizations in applications
  • 28. Structured data In Amazon Redshift Load predictions into Amazon Redshift -or- Read prediction results directly from Amazon S3 Predictions in Amazon S3 Query for predictions with Amazon ML batch API Your application Batch predictions with Amazon Redshift
  • 29. New Features Since Last Meetup https://aws.amazon.com/redshift/whats-new/
  • 30. AWS Schema Conversion Tool Exports from SQL Server, Oracle and Teradata to Amazon Redshift AWS Schema Conversion Tool (SCT) can now extract data from a Microsoft SQL Server data warehouse for direct import into Amazon Redshift. This follows the recently announced capability to convert SQL Server data warehouse schemas. AWS Schema Conversion Tool (SCT) can now also extract data from Teradata and Oracle data warehouses for direct import into Amazon Redshift.
  • 31. Encrypting unloaded data using Amazon S3 server-side encryption with AWS KMS keys The Amazon Redshift UNLOAD command now supports Amazon S3 server-side encryption using an AWS KMS key. The UNLOAD command unloads the results of a query to one or more files on Amazon S3. You can let Amazon Redshift automatically encrypt your data files using Amazon S3 server-side encryption, or you can specify a symmetric encryption key that you manage. With this release, you can use Amazon S3 server-side encryption with a key managed by AWS KMS. In addition, the COPY command loads Amazon S3 server- side encrypted data files without requiring you to provide the key.
  • 32. Zstandard for high data compression encoding Amazon Redshift now supports Zstandard (ZSTD) column compression encoding, which delivers better data compression thereby reducing the amount of storage and I/O needed. With the addition of ZSTD, Amazon Redshift now offers seven compression encodings to choose from depending on your dataset.
  • 33. Query monitoring rules (QMR) You can use query monitoring rules feature to set metrics-based performance boundaries for workload management (WLM) queues, and specify what action to take when a query goes beyond those boundaries. For example, for a queue that’s dedicated to short running queries, you might create a rule that aborts queries that run for more than 60 seconds. To track poorly designed queries, you might have another rule that logs queries that contain nested loops. We also provide pre- defined rule templates in the Amazon Redshift management console to get you started.
  • 34. New Functions STV_QUERY_METRICS displays the metrics for currently running queries and STL_QUERY_METRICS records the metrics for completed queries. The new APPROXIMATE PERCENTILE_DISC function returns the value in a list that's closest to a given percentile. Approximation enables the function to execute much faster, with a relative error of around 0.5 percent. Previously available as window functions, PERCENTILE_CONT and MEDIAN are now also available as aggregate functions.
  • 35. Support for Python UDF logging module You can now use the standard Python logging module to log error and warning messages from Amazon Redshift user-defined functions (UDF). You can then query the SVL_UDF_LOG system view to retrieve the messages logged from your UDF’s and troubleshoot your UDF’s easily. For more information and examples, see Logging Errors and Warnings in UDFs in the Amazon Redshift Database Developer Guide.
  • 36. Record and govern Amazon Redshift configurations with AWS Config You can now record configuration changes to your Amazon Redshift clusters with AWS Config. The detailed configuration recorded by AWS Config includes changes made to Amazon Redshift clusters, cluster parameter groups, cluster security groups, cluster snapshots, cluster subnet groups, and event subscriptions. In addition, you can run two new managed Config Rules to check whether your Amazon Redshift clusters have the appropriate configuration and maintenance settings. These checks include verifying that your cluster database is encrypted, logging is enabled, snapshot data retention period is set appropriately, and much more.
  • 37. Kinesis Firehose can now prepare and transform streaming data before loading it to data stores You can now configure Amazon Kinesis Firehose to prepare your streaming data before it is loaded to data stores. With this new feature, you can easily convert raw streaming data from your data sources into formats required by your destination data stores, without having to build your own data processing pipelines. To use this feature, simply select an AWS Lambda function from the Amazon Kinesis Firehose delivery stream configuration tab in the AWS Management console. Amazon Kinesis Firehose will automatically apply that function to every input data record and load the transformed data to destinations.
  • 38. Open Source: Amazon Kinesis Data Generator (KDG) The Amazon Kinesis Data Generator is a UI that simplifies how you send test data to Amazon Kinesis Streams or Amazon Kinesis Firehose. Using the Amazon Kinesis Data Generator, you can create templates for your data, create random values to use for your data, and save the templates for future use. bit.ly/2tgQVJc
  • 40. Amazon Redshift Spectrum Run SQL queries directly against data in S3 using thousands of nodes Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query High concurrency: Multiple clusters access same data No ETL: Query data in-place using open file formats Full Amazon Redshift SQL support S3 SQL
  • 41. Query SELECT COUNT(*) FROM S3.EXT_TABLE GROUP BY… Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 1
  • 42. Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore Query is optimized and compiled at the leader node. Determine what gets run locally and what goes to Amazon Redshift Spectrum 2
  • 43. Query plan is sent to all compute nodes Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 3
  • 44. Compute nodes obtain partition info from Data Catalog; dynamically prune partitions Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 4
  • 45. Each compute node issues multiple requests to the Amazon Redshift Spectrum layer Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 5
  • 46. Amazon Redshift Spectrum nodes scan your S3 data Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 6
  • 47. 7 Amazon Redshift Spectrum projects, filters, joins and aggregates Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore
  • 48. Final aggregations and joins with local Amazon Redshift tables done in-cluster Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 8
  • 49. Result is sent back to client Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 9
  • 50. Amazon Redshift Spectrum – Current support File formats • Parquet • CSV • Sequence • RCFile • ORC (coming soon) • RegExSerDe (coming soon) Compression • Gzip • Snappy • Lzo (coming soon) • Bz2 Encryption • SSE with AES256 • SSE KMS with default key Column types • Numeric: bigint, int, smallint, float, double and decimal • Char/varchar/string • Timestamp • Boolean • DATE type can be used only as a partitioning key Table type • Non-partitioned table (s3://mybucket/orders/..) • Partitioned table (s3://mybucket/orders/date=YYYY-MM- DD/..)
  • 51. Is Amazon Redshift Spectrum useful if I don’t have an exabyte? Your data will get bigger On average, data warehousing volumes grow 10x every 5 years The average Amazon Redshift customer doubles data each year Amazon Redshift Spectrum makes data analysis simpler Access your data without ETL pipelines Teams using Amazon EMR, Athena & Redshift can collaborate using the same data lake Amazon Redshift Spectrum improves availability and concurrency Run multiple Amazon Redshift clusters against common data Isolate jobs with tight SLAs from ad hoc analysis
  • 52. Spectrum Building a Data Strategy on AWS Kinesis Firehose 1 2 3 4 5 6 Athena Query Service 7 8 Glue Batch 9 10
  • 54. 4 types of partners • Load and transform your data with Data Integration Partners • Analyze data and share insights across your organization with Business Intelligence Partners • Architect and implement your analytics platform with System Integration and Consulting Partners • Query, explore and model your data using tools and utilities from Query and Data Modeling Partners aws.amazon.com/redshift/partners/
  • 56. 1. Analyze Database Audit Logs for Security and Compliance Using Amazon Redshift Spectrum 2. Build a Healthcare Data Warehouse Using Amazon EMR, Amazon Redshift, AWS Lambda, and OMOP 3. Run Mixed Workloads with Amazon Redshift Workload Management 4. Converging Data Silos to Amazon Redshift Using AWS DMS 5. Powering Amazon Redshift Analytics with Apache Spark and Amazon Machine Learning 6. Using pgpool and Amazon ElastiCache for Query Caching with Amazon Redshift 7. Extending Seven Bridges Genomics with Amazon Redshift and R 8. Zero Admin Lambda based Redshift Loader Architecture Tuning Integration Spectrum Echo System OLX Summary Fast Compatible Secure Elastic Simple Cost Efficient amzn.to/2tlylga amzn.to/2srIL1g amzn.to/2rgR8Z7 amzn.to/2lr66MH amzn.to/2kIr1bq amzn.to/2rr7LWq amzn.to/2szR3nf bit.ly/2swvvI6