London Redshift Meetup - July 2017

Amazon Redshift
shift
Fast, simple, petabyte-scale data warehousing for $1,000/TB/Year
150+ features
Pratim Das
Specialist Solutions Architect, Data & Analytics,
EMEA

Managed Massively Parallel Petabyte Scale
Data Warehouse
Streaming Backup/Restore to S3
Load data from S3, DynamoDB and EMR
Extensive Security Features
Scale from 160 GB -> 2 PB Online
Fast
CompatibleSecure
ElasticSimple
Cost
Efficient

Amazon Redshift Cluster Architecture
Massively parallel, shared nothing
Leader node
• SQL endpoint
• Stores metadata
• Coordinates parallel SQL processing
Compute nodes
• Local, columnar storage
• Executes queries in parallel
• Load, backup, restore
• 2, 16 or 32 slices
10 GigE
(HPC)
Ingestion
Backup
Restore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
S3 / EMR / DynamoDB / SSH
JDBC/ODBC
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
Leader
Node

Use Case: Traditional Data Warehousing
Business
Reporting
Advanced pipelines
and queries
Secure and
Compliant
Easy Migration – Point & Click using AWS Database Migration Service
Secure & Compliant – End-to-End Encryption. SOC 1/2/3, PCI-DSS, HIPAA and FedRAMP compliant
Large Ecosystem – Variety of cloud and on-premises BI and ETL tools
Japanese Mobile
Phone Provider
Powering 100 marketplaces
in 50 countries
World’s Largest Children’s
Book Publisher
Bulk Loads
and Updates

Use Case: Log Analysis
Log & Machine
IOT Data
Clickstream
Events Data
Time-Series
Data
Cheap – Analyze large volumes of data cost-effectively
Fast – Massively Parallel Processing (MPP) and columnar architecture for fast queries and parallel loads
Near real-time – Micro-batch loading and Amazon Kinesis Firehose for near-real time analytics
Interactive data analysis and
recommendation engine
Ride analytics for pricing
and product development
Ad prediction and
on-demand analytics

Use Case: Business Applications
Multi-Tenant BI
Applications
Back-end
services
Analytics as a
Service
Fully Managed – Provisioning, backups, upgrades, security, compression all come built-in so you can
focus on your business applications
Ease of Chargeback – Pay as you go, add clusters as needed. A few big common clusters, several
data marts
Service Oriented Architecture – Integrated with other AWS services. Easy to plug into your pipeline
Infosys Information
Platform (IIP)
Analytics-as-a-
Service
Product and Consumer
Analytics

Effective use of MPP
architecture

Design for Queryability
• Equally on each slice
• Minimum amount of work
• Use just enough cluster resources

Do an Equal Amount of Work
on Each Slice

Choose Best Table Distribution Style
All
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
All data on
every node
Key
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Same key to
same location
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Even
Round robin
distribution

Do the Minimum Amount of
Work on Each Slice

Columnar storage
+
Large data block sizes
+
Data compression
+
Zone maps
+
Direct-attached storage
analyze compression listing;
Table | Column | Encoding
---------+----------------+----------
listing | listid | delta
listing | sellerid | delta32k
listing | eventid | delta32k
listing | dateid | bytedict
listing | numtickets | bytedict
listing | priceperticket | delta32k
listing | totalprice | mostly32
listing | listtime | raw
10 | 13 | 14 | 26 |…
… | 100 | 245 | 324
375 | 393 | 417…
… 512 | 549 | 623
637 | 712 | 809 …
… | 834 | 921 | 959
10
324
375
623
637
959
Reduced I/O = Enhanced Performance

Use Cluster Resources
Efficiently to Complete as
Quickly as Possible

Amazon Redshift Workload Management
Waiting
Workload Management
BI tools
SQL clients
Analytics tools
Client
Running
Queries: 80% memory
ETL: 20% memory
4 Slots
2 Slots
80/4 = 20% per slot
20/2 = 10% per slot

Query monitoring rules
Common use cases:
• Protect interactive queues
INTERACTIVE = { “query_execution_time > 15 sec” or
“query_cpu_time > 1500 uSec” or
”query_blocks_read > 18000 blocks” } [HOP]
• Monitor ad-hoc queues for heavy queries
AD-HOC = { “query_execution_time > 120” or
“query_cpu_time > 3000” or
”query_blocks_read > 180000” or
“memory_to_disk > 400000000000”} [LOG]
• Limit the number of rows returned to a client
MAXLINES = { “RETURN_ROWS > 50000” } [ABORT]

Redshift Playbook
Part 1: Preamble, Prerequisites, and
Prioritization
Part 2: Distribution Styles and
Distribution Keys
Part 3: Compound and Interleaved
Sort Keys
Part 4: Compression Encodings
Part 5: Table Data Durability
amzn.to/2quChdM

Optimizing Amazon Redshift by Using the AWS
Schema Conversion Tool
amzn.to/2sTYow1

Getting data to Redshift using AWS DMS
Simple to use Minimal Downtime Supports most widely
used Databases
Low Cost Fast & Easy to Set-up Reliable

Loading data from S3
• Splitting Your Data into Multiple Files
• Uploading Files to Amazon S3
• Using the COPY Command to Load from
Amazon S3

QuickSight for BI on Redshift
Amazon Redshift

Querying Redshift with R Packages
• RJDBC – supports SQL queries
• dplyr – Uses R code for data
analysis
• RPostgreSQL - R compliant
driver or Database Interface
(DBI)
R User
R Studio
Amazon
EC2
Unstructured Data
Amazon S3
User Profile
Amazon RDS
Amazon Redshift
Connecting R with Amazon Redshift blog post: https://blogs.aws.amazon.com/bigdata/post/Tx1G8828SPGX3PK/Connecting-R-with-Amazon-Redshift

Statistical UDF Example
CREATE FUNCTION f_z_test_by_pval (alpha float,
x_bar float, test_val float, sigma float, n float)
RETURNS varchar
STABLE AS $$
import scipy.stats as st
import math as math
z = (x_bar - test_val) / (sigma /
math.sqrt(n))
p = st.norm.cdf(z)
if p <= alpha:
return 'Statistically significant'
else:
return 'May have occurred by random chance'
$$ LANGUAGE plpythonu;
Introduction to Python UDFs in Amazon Redshift: https://blogs.aws.amazon.com/bigdata/post/Tx1IHV1G67CY53T/Introduction-to-Python-
UDFs-in-Amazon-Redshift

Application Developers can build smart
applications using Amazon Machine
Learning
Structured Data/Predictions
Amazon Redshift
Generate/Query
Predictions
Amazon QuickSight
Application
Amazon Machine
Learning
Visualize
• All skill levels
• Machine Learning technology is accessed through APIs / SDKs
• Embed visualizations in applications

Structured data
In Amazon Redshift
Load predictions into
Amazon Redshift
-or-
Read prediction results
directly from Amazon S3
Predictions
in Amazon S3
Query for predictions with
Amazon ML batch API
Your application
Batch predictions with Amazon Redshift

New Features Since Last
Meetup
https://aws.amazon.com/redshift/whats-new/

AWS Schema Conversion Tool Exports from
SQL Server, Oracle and Teradata to Amazon
Redshift
AWS Schema Conversion Tool (SCT) can now extract data
from a Microsoft SQL Server data warehouse for direct
import into Amazon Redshift. This follows the recently
announced capability to convert SQL Server data
warehouse schemas.
AWS Schema Conversion Tool (SCT) can now also extract
data from Teradata and Oracle data warehouses for direct
import into Amazon Redshift.

Encrypting unloaded data using Amazon S3
server-side encryption with AWS KMS keys
The Amazon Redshift UNLOAD command now supports Amazon S3
server-side encryption using an AWS KMS key. The UNLOAD
command unloads the results of a query to one or more files on
Amazon S3. You can let Amazon Redshift automatically encrypt your
data files using Amazon S3 server-side encryption, or you can specify
a symmetric encryption key that you manage. With this release, you
can use Amazon S3 server-side encryption with a key managed by
AWS KMS. In addition, the COPY command loads Amazon S3 server-
side encrypted data files without requiring you to provide the key.

Zstandard for high data compression encoding
Amazon Redshift now supports Zstandard (ZSTD) column
compression encoding, which delivers better data
compression thereby reducing the amount of storage and
I/O needed. With the addition of ZSTD, Amazon Redshift
now offers seven compression encodings to choose from
depending on your dataset.

Query monitoring rules (QMR)
You can use query monitoring rules feature to set metrics-based
performance boundaries for workload management (WLM) queues,
and specify what action to take when a query goes beyond those
boundaries.
For example, for a queue that’s dedicated to short running queries, you
might create a rule that aborts queries that run for more than 60
seconds. To track poorly designed queries, you might have another rule
that logs queries that contain nested loops. We also provide pre-
defined rule templates in the Amazon Redshift management console to
get you started.

New Functions
STV_QUERY_METRICS displays the metrics for currently running
queries and STL_QUERY_METRICS records the metrics for
completed queries.
The new APPROXIMATE PERCENTILE_DISC function returns the
value in a list that's closest to a given percentile. Approximation
enables the function to execute much faster, with a relative error of
around 0.5 percent.
Previously available as window functions, PERCENTILE_CONT and
MEDIAN are now also available as aggregate functions.

Support for Python UDF logging module
You can now use the standard Python logging module to
log error and warning messages from Amazon Redshift
user-defined functions (UDF). You can then query the
SVL_UDF_LOG system view to retrieve the messages
logged from your UDF’s and troubleshoot your UDF’s
easily.
For more information and examples, see Logging Errors
and Warnings in UDFs in the Amazon Redshift Database
Developer Guide.

Record and govern Amazon Redshift
configurations with AWS Config
You can now record configuration changes to your Amazon Redshift
clusters with AWS Config. The detailed configuration recorded by AWS
Config includes changes made to Amazon Redshift clusters, cluster
parameter groups, cluster security groups, cluster snapshots, cluster
subnet groups, and event subscriptions. In addition, you can run two
new managed Config Rules to check whether your Amazon Redshift
clusters have the appropriate configuration and maintenance settings.
These checks include verifying that your cluster database is encrypted,
logging is enabled, snapshot data retention period is set appropriately,
and much more.

Kinesis Firehose can now prepare and
transform streaming data before loading it to
data stores
You can now configure Amazon Kinesis Firehose to prepare your
streaming data before it is loaded to data stores. With this new feature,
you can easily convert raw streaming data from your data sources into
formats required by your destination data stores, without having to
build your own data processing pipelines.
To use this feature, simply select an AWS Lambda function from the
Amazon Kinesis Firehose delivery stream configuration tab in the AWS
Management console. Amazon Kinesis Firehose will automatically
apply that function to every input data record and load the transformed
data to destinations.

Open Source: Amazon Kinesis Data Generator
(KDG)
The Amazon Kinesis Data Generator
is a UI that simplifies how you send
test data to Amazon Kinesis Streams
or Amazon Kinesis Firehose. Using
the Amazon Kinesis Data Generator,
you can create templates for your
data, create random values to use for
your data, and save the templates for
future use. bit.ly/2tgQVJc

Amazon Redshift Spectrum
Run SQL queries directly against data in S3 using thousands of nodes
Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query
High concurrency: Multiple
clusters access same data
No ETL: Query data in-place
using open file formats
Full Amazon Redshift
SQL support
S3
SQL

Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY…
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
1

Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Data Catalog
Query is optimized and compiled at
the leader node. Determine what gets
run locally and what goes to Amazon
Redshift Spectrum
2

Query plan is sent to
all compute nodes
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Data Catalog
3

Compute nodes obtain partition info from
Data Catalog; dynamically prune partitions
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Data Catalog
4

Each compute node issues multiple
requests to the Amazon Redshift
Spectrum layer
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Data Catalog
5

Amazon Redshift Spectrum nodes
scan your S3 data
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Data Catalog
6

7
Amazon Redshift
Spectrum projects,
filters, joins and
aggregates
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Data Catalog

Final aggregations and joins
with local Amazon Redshift
tables done in-cluster
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Data Catalog
8

Result is sent back to client
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Data Catalog
9

Amazon Redshift Spectrum – Current support
File formats
• Parquet
• CSV
• Sequence
• RCFile
• ORC (coming soon)
• RegExSerDe (coming soon)
Compression
• Gzip
• Snappy
• Lzo (coming soon)
• Bz2
Encryption
• SSE with AES256
• SSE KMS with default
key
Column types
• Numeric: bigint, int, smallint, float, double
and decimal
• Char/varchar/string
• Timestamp
• Boolean
• DATE type can be used only as a
partitioning key
Table type
• Non-partitioned table
(s3://mybucket/orders/..)
• Partitioned table
(s3://mybucket/orders/date=YYYY-MM-
DD/..)

Is Amazon Redshift Spectrum useful if I don’t have an exabyte?
Your data will get bigger
On average, data warehousing volumes grow 10x every 5 years
The average Amazon Redshift customer doubles data each year
Amazon Redshift Spectrum makes data analysis simpler
Access your data without ETL pipelines
Teams using Amazon EMR, Athena & Redshift can collaborate using the same data lake
Amazon Redshift Spectrum improves availability and concurrency
Run multiple Amazon Redshift clusters against common data
Isolate jobs with tight SLAs from ad hoc analysis

Spectrum
Building a Data Strategy on AWS
Kinesis Firehose
1
2
3
4
5
6
Athena
Query Service
7
8
Glue
Batch
9
10

4 types of partners
• Load and transform your data with Data Integration
Partners
• Analyze data and share insights across your
organization with Business Intelligence Partners
• Architect and implement your analytics platform
with System Integration and Consulting Partners
• Query, explore and model your data using tools and
utilities from Query and Data Modeling Partners
aws.amazon.com/redshift/partners/

“Some” Amazon Redshift Customers

1. Analyze Database Audit Logs for Security and
Compliance Using Amazon Redshift Spectrum
2. Build a Healthcare Data Warehouse Using Amazon EMR,
Amazon Redshift, AWS Lambda, and OMOP
3. Run Mixed Workloads with Amazon Redshift Workload
Management
4. Converging Data Silos to Amazon Redshift Using AWS
DMS
5. Powering Amazon Redshift Analytics with Apache Spark
and Amazon Machine Learning
6. Using pgpool and Amazon ElastiCache for Query Caching
with Amazon Redshift
7. Extending Seven Bridges Genomics with Amazon Redshift
and R
8. Zero Admin Lambda based Redshift Loader
Architecture Tuning Integration Spectrum
Echo
System
OLX Summary
Fast
Compatible
Secure
Elastic
Simple
Cost
Efficient
amzn.to/2tlylga
amzn.to/2srIL1g
amzn.to/2rgR8Z7
amzn.to/2lr66MH
amzn.to/2kIr1bq
amzn.to/2rr7LWq
amzn.to/2szR3nf
bit.ly/2swvvI6

We are hiring!!!
www.amazon.jobs

London Redshift Meetup - July 2017

More Related Content

What's hot (20)

Similar to London Redshift Meetup - July 2017 (9)

Recently uploaded (20)

London Redshift Meetup - July 2017