Serverless SQL

Serverless SQL
Torsten Steinbach
@torsstei
IBM
1

SQL on Object
Storage
DM Gartner
Hype Cycle
2018

Evolution of Form Factors
For Big Data Analytics
Enterprise Data
Warehouses
Tightly integrated and
optimized systems
Hadoop
Introduced open data formats & easy
scaling on commodity HW
Cloud-Native:
Serverless Analytics-aaS
• Seamless elasticity
• Pay-per-query consumption
• Analyze data as it sits in an object store
• Disaggregated architecture
• No more infrastructure head aches
The 90-ies 2000 Today

Ingredient 3: Serverless Data Transformation
Ingredient 4: Serverless Analytics
Ingredient 5: Serverless Automation
Ingredient 2: Serverless Data Ingest
Sharing Economy for Analytics
Ingredient 1: Serverless Storage

Object Storage
IBM Cloud Object Storage
Objects
Objects
Objects
At Rest
On the Wire
Buckets
Encrypted
Pennies per GB
REST
Elastic
Durable
Flexible
Resiliency Choices
Storage Classes
User Managed
Encryption Keys
S3 Compatible
High Speed Data
Transfer
Aspera
SQL Queries

Data Ingest Options
6
High Customizability
Degree of Serverless-ness
IBM Event Streams
(Kafka aaS)
IBM Cloud Functions
Out-of-the-Box
IBM Streaming Analytics
(IBM Streams aaS)
via Cloud Object Storage API
SQL Query ETL
Cloudant Replication
Blockchain Synch

Cloud Data
Data
Transformation
Serverless SQL
Analytics
IBM SQL Query
Object
Storage
Db2
+
Developers
Data
Engineers
Data Analysts
ü Perfect for Machine Generated Data
ü Ad-hoc Data Exploration
ü Operationalizing Data Pipelines
ü Big Data Lakes
ü Flexible Data Transformation
ü Extremely affordable. 5$/TB scanned
ü 100% API enabled
ü Analytics on Object Storage
ü Big Data Scale-Out. Running on Spark
ü 100% Self service – No Setup

2. Read data
4. Read
results
Application
3. Write results
IBM Cloud
Object Storage
Result SetData Set
Data Set
Data Set
1. Submit SQL
SQL
Archive / Export
IBM Cloud Streaming
IBM Streams
Event Streams
Land
Query
IBM Cloud Functions
IBM SQL Query
Architecture
IBM Cloud Databases
Db2 on Cloud
Geospatial SQLData Skipping
Timeseries SQL
Upload

Data Center 2
Analytics Engine Cluster
20 Kernels
Node 1
Node 3
Node 2
Node 3
…
20
Kernels
…
Data Center 3
20 Kernels
Node 1
Node 3
Node 2
Node 3
…
20
Kernels
…
SQL 1 SQL 1
Data Center 1
IBM Cloud SQL Query – Very High Level Architecture (MVP 1Q 2018)
20 Kernels
Cluster
Pool
Request Queue
Node 1
Node 3
Node 2
Node 3
…
Kernel
Pools
20
Kernels
…
SQL 1 SQL 2 SQL 3 SQL 4 SQL 5
Cloud Object Storage
SQL 6 …
JKG (Web Sockets)
IBM Cloud Query – Spark Cluster Architecture

SQL REST API
Create
Query
SQL Web Console
Watson
Studio
Notebooks
SQL Cloud Function
Integrate Explore
Deploy
IBM Cloud Query – Access Patterns
Node SDK
Python SDK
JDBCLooker

Best of breed Spark SQL Reference
• Complete, intuitive and interactive SQL Reference
• Each sample SQL can immediately be executed as is
https://cloud.ibm.com/docs/services/sql-query/sqlref/sql_reference.html#sql-reference
Analytics using full Power of Spark SQL

IBM SQL Query – Timeseries SQL 1/2
§ Intuitive first-of-a-kind SQL extensions for timeseries operations
§ Industry leading differentiators, including:
• Timeseries transformation functions:
• Correlation, Fourier transformation,
z-normalization, Granger, interpolation,
and distances
• Temporal Joins: SQL support for
Left/Right/Full Inner and Outer joins
of multiple timeseries
Alignment & Joining:

§ Further Industry leading differentiators
• Numerical and categorical timeseries types
• Timeseries data skipping for fast queries
• Forecasting:
• ARIMA, BATS, Anomaly detection, etc.
• Subsequence Mining:
• Train & match models for event sequences
• Segmentation:
• Time-based, Record-based, Anchor-based, Burst, and silence
Segmentation:
IBM SQL Query – Timeseries SQL 2/2

• IBM SQL Query – Spatial SQL
§ SQL/MM standard to store & analyze spatial data in RDBMS
§ Migration of PostGIS compliant SQL queries
§ Aggregation, computation and join via native SQL syntax
§ Industry leading differentiators
• Geodetic Full Earth support
• Increased developer productivity
• Avoid piece-wise planar projections
• High precision calculations anywhere on the earth
• Very large polygons (e.g. countries), polar caps, x-ing anti-meridian
• Spatial data skipping for fast queries
• Native and fine-granular geohash support
• Fast spatial aggregation

Example: Spatio-Temporal Processing of Sensor Data
Sensor
Data
Query
Location
Analytics
Mobile
Cars
Devices
Land
Location
Filtering
Spatial
Aggregation
GPS
SQL/MM
Sensor
Metrics
t
t
t
Timeseries
Assembly
Timeseries
Join
Timeseries SQL
t

Serverless
Storage
Serverless
Runtimes
Serverless
Analytics
Object
Storage
Cloud
Functions
Query
A Completely Serverless Stack for Data & Analytics Solutions

Unstructured Data Prep
SQL Query
Cloud
Functions
Analyze
COSCOS
Extract Features
Automated/Scheduled SQL Execution
SQL Query
Cloud
Functions
Develop SQL Deploy as SQL Cloud Function
Set up Cloud
Function
Trigger/Schedule
Shield Data From Direct Access
SQL Query
Cloud
Functions
Deploy Cloud Function
with COS API Key
User Calls
Function to
Access Data
COS
Grant Execute on SQL
Cloud Function to User
Configure SQL Pipelines
SQL Query
Cloud
Functions
User creates function
sequence to automate flow
of consecutive SQLs
Sequence
SQL Query
Cloud
Functions
1.
2.
Use Cases of Cloud Functions Adding Value to SQL

Ingredient 3: Serverless Data Transformation ✓
Ingredient 4: Serverless Analytics ✓
Ingredient 5: Serverless Automation ✓
Ingredient 2: Serverless Data Ingest ✓
Ingredient 1: Serverless Storage ✓
Now, what is this all good for?

Acquire
Query
Data Warehouses &
Databases
Db2 on Cloud
Process Analyze
ApplicationsApplications
Applications
IoT
Streaming
Devices
Devices
Devices
BI & AI
Land
Log Messages
Cleanse
Filter
Merge
Aggregate
Compress
Watson Studio
Looker
Cognos
WML
Explore
Analyze Analyze
Promote
Use for Data Pipelines to fuel BI & AI

Data –Driven Decisions
☛ Understanding system health, user behavior & workload status
Collecting & Analyzing Log Data
☛ Is NOT and afterthought but rather foundation for decisions on
system and feature design.
Data Volume Growing Rapidly
☛ Growth rates and data volume at rest can jump dramatically. Very
high elasticity is required.
Competitive Advantage
☛ Is based on short runways for turning data into actions
Turn your Logs into Business – Log Data Is The Cloud-Native Currency

Logs
Your Cloud
Application/Solution
Query
Transform
Compress
Aggregate
Repartition
Analyze
Anomaly Detection
User Segmentation
Customer Support
Resource Planning
• Build & run data pipelines and analytics of your log message data
• Flexible log data analytics with full power of SQL
• Seamless scalability & elasticity according to your log message volume
Use for analyzing application logs

IDUG Db2 Tech Conference
Charlotte, NC | June 2 – 6, 2019
Data Lake in IBM Cloud – How it works
IBM Cloud Data LakeData
Streaming
Upload
ETL
DB2
Feature
Extraction
Data
Prep
ICD
DB2
ICD
OLAP
Analytics WML
ETL
Federate
Asper
a
Cloudant
Replication
Secure
Sync
IBM
Blockchain
Application
s
Application
s Watson
Studio
Knowledge
Catalog
METASTORE
AI
ICP for DataAnalytics
Engine
IBM Cloud
Functions
Land Process Integrate
Key Protect
Index
Creation

Getting started: https://www.ibm.com/cloud/sql-query
SQL Query Intro Video: https://youtu.be/s-FznfHJpoU
SQL Query Starter Notebook in Watson Studio: https://ibm.biz/BdYNrN
SQL Reference: https://ibm.biz/Bd2jF7
SQL Query API doc: https://cloud.ibm.com/apidocs/sql-query
Big Data Layout Best Practices for COS: https://ibm.biz/Bd2jRg
Serverless Data & Analytics: https://ibm.biz/Bd2jF5
Further Resources

1. Identify friction points in users’ digital journey, e.g.:
• Clicks-2-purchase ratio
• Unexpected repeated page visits per user
• E.g. entering payment data should only happen once
• Last page visited per session
2. Identify click sequences for successful purchase
• Sequence matching using timeseries analysis
3. Identify customers/segments likely to churn or expand
• Look for typical page visits, actions or flows
• E.g. Terms & conditions, invite additional users etc.
4. Determining your most important content online
What Insights can I extract from a Clickstream?

Building IBM Cloud-Native Data Lake
Serverless SQL
Serverless Storage
Serverless Pipeline
Automation ✓
✓
✓
Orchestration
Processing
Persistency Data Ingest
✓
Data Catalog ✓
Serverless
Unstructured Data
Processing ✓

• Traditional analytics systems
• Fixed capacities of appliances
• Specialized teams of data engineers & DBAs who manage data model, access and ETL
• BI analysts who have access only to the curated data sets in EDW
• Innovative enterprises today
• Wide range of teams that require direct access to same data set at all stages of the data
pipeline: BI analysts, data scientists, quantitative marketers, dev/ops, developers
• Data engineers that support these teams need a much, much more scalable and cost-
effective platform to ensure all teams have access they need and when needed
• Building analytics platforms in the cloud because of the scale and cost-efficiencies that
come with serverless analytics over object stores
Serverless – The key to IT Sharing Economy ... also for Analytics

Proper data organization è
better performance and lower cost
29
,
2
0
1
9
/
©
2
0
1
9
I
B
M
C
o
r
p
o
r
a
t
i
o
n
The key factors are:
• Number of bytes shipped
• Number of REST requests
Best practices for structured data:
• Choose the right object size (sweet spot: 128 MB)
• Choose the right format
• Choose the right data layout
• Avoid gzip compressed formats
Applies to SQL Query but also
applies to other Big Data engines
To learn more: https://www.ibm.com/blogs/bluemix/2018/06/big-data-layout/

Which Format is Query-Friendly?

2. Use Hive style partitioning
GPMeterStream/dt=2017-08-17/part-00085.csv
Avoid reading unnecessary objects altogether
Technique has limitations
Best Practice: minimize bytes scanned
1. Use Parquet
• Column based
• Only read the columns you need
• Column wise compression
• Min/max metadata

Table Locators
cos://<endpoint>/<bucket>/[<prefix>] <format definition>
Endpoint – of your object storage bucket or a short alias
E.g. s3.us-south.objectstorage.appdomain.cloud or alias us-south
Bucket – name in object storage
Prefix – one or multiple objects (i.e. table partitions) with same prefix
Used in FROM clauses for input data and in target field for result set data
Examples:
cos://us-south/myBucket/myFolder/mySubFolder/myData.parquet
cos://us/otherBucket/myData
cos://us/otherBucket/myData/part
cos://eu/newBucket/

<Table Locator> [JOBPREFIX JOBID | NONE]
[STORED AS CSV | PARQUET | JSON]
• Specifies the data format of the input data
• Table schema is automatically inferred at SQL execution time
• STORED AS Clause is optional, the default is CSV
• Additional parameters for CSV:
• E.g.: FIELDS TERMINATEY BY ‘t’ NOHEADER
• JOBPREFIX only for targets: defines unique prefix to append. Default is JOBID.
Table Format Definition

SELECT … INTO
<Table Locator> [STORED AS CSV | PARQUET | JSON]
[PARTITIONED [BY (<column list>)]
[INTO <num> BUCKETS]
[EVERY <num> ROWS]]
[SORT BY (<column list>)]
BY: Produces Hive Style Partitioning
INTO: Produced fix number of partitions (hash partitioned)
EVERY: Produces partitioned of even size (e.g. for pagination)
SORT BY: Exact result order & clustering when combined with PARTITIONED
Table Partitioning Definition

Submit a SQL query
POST https://api.sql-query.cloud.ibm.com/v2/sql_jobs
Runs the SQL in the background and returns a job_id
Detailed info for a SQL query (e.g. status, result location)
GET https://api.sql-query.cloud.ibm.com /v2/sql_jobs/{job_id}
Returns JSON with query execution details
List of recent SQL query executions
GET https://api.sql-query.cloud.ibm.com /v2/sql_jobs
Returns JSON array with last 30 SQL submissions and outcomes
IBM SQL Query REST API

Scaling Analytics: Data Skipping Saving you Time and $
Index All
Objects
Data Set Objects
SQL
Query
Data Skipping
Indexing
Candidate
Objects
WHERE Clause
Saving Time
and $
SQL Query learns which objects are not relevant to a query
using a data skipping index
CREATE METAINDEX stores index summary metadata for
each object. Much smaller than the data.
SQLs skipping irrelevant objects to significantly reduce I/O
E.g.:
Independent of data formats
Index Types: Min/Max, Value List, Bounding Box
Get location and time of heat waves (>40 celcius)
SELECT lat, long, city, temp, date
FROM weather
WHERE temp > 40.0

• JDBC compliant driver library that wraps REST API
• Wrapping both, SQL Query and COS REST API
• Exposing regular session interface (JDBC Connection)
• Enabling custom JDBC application support
• Enabling BI application support
• Early adopter: Looker
• Support for stored table meta data (simple catalog)
• Stored as json in COS and referenced via JDBC
connection string
• I.e. DatabaseMetaData interface also supported
JDBC Driver for BI Applications
Apply for Beta Now
Query
JDBC Driver
REST
COS
JDBC
API
DataResult
Sets
Table
Catalog
E.g. Looker

Using SQL Query JDBC Driver
Define table catalog
• JSON file in COS containing:
• Table name
• Location of table objects on COS
• Object format
• Column names
• Column types
• INT, FLOAT, VARCHAR, TIMESTAMP
JDBC Connection String:
jdbc:SQLQuery:<sql-query instance crn>
?schemabucket=<COS bucket with json catalog>
?schemafile=<COS object with json catalog>
&apikey=<api key for your account>
&targetcosurl=<COS URL for result set>

IBM Cloud Functions
Fair Never pay for idle
Polyglot
Elastic
Automation
Triggers
Open Source
CLOUD
FUNCTIONS
Schedules
Sequences

Serverless SQL

More Related Content

What's hot (16)

Similar to Serverless SQL (20)

More from Torsten Steinbach (11)

Recently uploaded (20)

Serverless SQL