Serverless SQL
Torsten Steinbach
@torsstei
IBM
1
SQL on Object
Storage
DM Gartner
Hype Cycle
2018
Evolution of Form Factors
For Big Data Analytics
Enterprise Data
Warehouses
Tightly integrated and
optimized systems
Hadoop
Introduced open data formats & easy
scaling on commodity HW
Cloud-Native:
Serverless Analytics-aaS
• Seamless elasticity
• Pay-per-query consumption
• Analyze data as it sits in an object store
• Disaggregated architecture
• No more infrastructure head aches
The 90-ies 2000 Today
Ingredient 3: Serverless Data Transformation
Ingredient 4: Serverless Analytics
Ingredient 5: Serverless Automation
Ingredient 2: Serverless Data Ingest
Sharing Economy for Analytics
Ingredient 1: Serverless Storage
Object Storage
IBM Cloud Object Storage
Objects
Objects
Objects
At Rest
On the Wire
Buckets
Encrypted
Pennies per GB
REST
Elastic
Durable
Flexible
Resiliency Choices
Storage Classes
User Managed
Encryption Keys
S3 Compatible
High Speed Data
Transfer
Aspera
SQL Queries
Data Ingest Options
6
High Customizability
Degree of Serverless-ness
IBM Event Streams
(Kafka aaS)
IBM Cloud Functions
Out-of-the-Box
IBM Streaming Analytics
(IBM Streams aaS)
via Cloud Object Storage API
SQL Query ETL
Cloudant Replication
Blockchain Synch
Cloud Data
Data
Transformation
Serverless SQL
Analytics
IBM SQL Query
Object
Storage
Db2
+
Developers
Data
Engineers
Data Analysts
ü Perfect for Machine Generated Data
ü Ad-hoc Data Exploration
ü Operationalizing Data Pipelines
ü Big Data Lakes
ü Flexible Data Transformation
ü Extremely affordable. 5$/TB scanned
ü 100% API enabled
ü Analytics on Object Storage
ü Big Data Scale-Out. Running on Spark
ü 100% Self service – No Setup
2. Read data
4. Read
results
Application
3. Write results
IBM Cloud
Object Storage
Result SetData Set
Data Set
Data Set
1. Submit SQL
SQL
Archive / Export
IBM Cloud Streaming
IBM Streams
Event Streams
Land
Query
IBM Cloud Functions
IBM SQL Query
Architecture
IBM Cloud Databases
Db2 on Cloud
Geospatial SQLData Skipping
Timeseries SQL
Upload
Data Center 2
Analytics Engine Cluster
20 Kernels
Node 1
Node 3
Node 2
Node 3
…
20
Kernels
…
Data Center 3
Analytics Engine Cluster
20 Kernels
Node 1
Node 3
Node 2
Node 3
…
20
Kernels
…
SQL 1 SQL 1
Data Center 1
IBM Cloud SQL Query – Very High Level Architecture (MVP 1Q 2018)
Analytics Engine Cluster
20 Kernels
Cluster
Pool
Request Queue
Node 1
Node 3
Node 2
Node 3
…
Kernel
Pools
20
Kernels
…
SQL 1 SQL 2 SQL 3 SQL 4 SQL 5
Cloud Object Storage
SQL 6 …
JKG (Web Sockets)
IBM Cloud Query – Spark Cluster Architecture
SQL REST API
Create
Query
SQL Web Console
Watson
Studio
Notebooks
SQL Cloud Function
Integrate Explore
Deploy
IBM Cloud Query – Access Patterns
Node SDK
Python SDK
JDBCLooker
Best of breed Spark SQL Reference
• Complete, intuitive and interactive SQL Reference
• Each sample SQL can immediately be executed as is
https://cloud.ibm.com/docs/services/sql-query/sqlref/sql_reference.html#sql-reference
Analytics using full Power of Spark SQL
IBM SQL Query – Timeseries SQL 1/2
§ Intuitive first-of-a-kind SQL extensions for timeseries operations
§ Industry leading differentiators, including:
• Timeseries transformation functions:
• Correlation, Fourier transformation,
z-normalization, Granger, interpolation,
and distances
• Temporal Joins: SQL support for
Left/Right/Full Inner and Outer joins
of multiple timeseries
Alignment & Joining:
§ Further Industry leading differentiators
• Numerical and categorical timeseries types
• Timeseries data skipping for fast queries
• Forecasting:
• ARIMA, BATS, Anomaly detection, etc.
• Subsequence Mining:
• Train & match models for event sequences
• Segmentation:
• Time-based, Record-based, Anchor-based, Burst, and silence
Segmentation:
IBM SQL Query – Timeseries SQL 2/2
• IBM SQL Query – Spatial SQL
§ SQL/MM standard to store & analyze spatial data in RDBMS
§ Migration of PostGIS compliant SQL queries
§ Aggregation, computation and join via native SQL syntax
§ Industry leading differentiators
• Geodetic Full Earth support
• Increased developer productivity
• Avoid piece-wise planar projections
• High precision calculations anywhere on the earth
• Very large polygons (e.g. countries), polar caps, x-ing anti-meridian
• Spatial data skipping for fast queries
• Native and fine-granular geohash support
• Fast spatial aggregation
Example: Spatio-Temporal Processing of Sensor Data
IBM Cloud Object Storage
Sensor
Data
Query
Location
Analytics
Mobile
Cars
Devices
Land
Location
Filtering
Spatial
Aggregation
GPS
SQL/MM
Sensor
Metrics
t
t
t
Timeseries
Assembly
Timeseries
Join
Timeseries SQL
t
Serverless
Storage
Serverless
Runtimes
Serverless
Analytics
Object
Storage
Cloud
Functions
Query
A Completely Serverless Stack for Data & Analytics Solutions
Unstructured Data Prep
SQL Query
Cloud
Functions
Analyze
COSCOS
Extract Features
Automated/Scheduled SQL Execution
SQL Query
Cloud
Functions
Develop SQL Deploy as SQL Cloud Function
Set up Cloud
Function
Trigger/Schedule
Shield Data From Direct Access
SQL Query
Cloud
Functions
Deploy Cloud Function
with COS API Key
User Calls
Function to
Access Data
COS
Grant Execute on SQL
Cloud Function to User
Configure SQL Pipelines
SQL Query
Cloud
Functions
User creates function
sequence to automate flow
of consecutive SQLs
Sequence
SQL Query
Cloud
Functions
1.
2.
Use Cases of Cloud Functions Adding Value to SQL
Ingredient 3: Serverless Data Transformation ✓
Ingredient 4: Serverless Analytics ✓
Ingredient 5: Serverless Automation ✓
Ingredient 2: Serverless Data Ingest ✓
Ingredient 1: Serverless Storage ✓
Now, what is this all good for?
IBM Cloud Object Storage
Acquire
Query
Data Warehouses &
Databases
Db2 on Cloud
Process Analyze
ApplicationsApplications
Applications
IoT
Streaming
Devices
Devices
Devices
BI & AI
Land
Log Messages
Cleanse
Filter
Merge
Aggregate
Compress
Watson Studio
Looker
Cognos
WML
Explore
Analyze Analyze
Promote
Use for Data Pipelines to fuel BI & AI
Data –Driven Decisions
☛ Understanding system health, user behavior & workload status
Collecting & Analyzing Log Data
☛ Is NOT and afterthought but rather foundation for decisions on
system and feature design.
Data Volume Growing Rapidly
☛ Growth rates and data volume at rest can jump dramatically. Very
high elasticity is required.
Competitive Advantage
☛ Is based on short runways for turning data into actions
Turn your Logs into Business – Log Data Is The Cloud-Native Currency
Logs
Your Cloud
Application/Solution
IBM Cloud Object Storage
Query
Transform
Compress
Aggregate
Repartition
Analyze
Anomaly Detection
User Segmentation
Customer Support
Resource Planning
• Build & run data pipelines and analytics of your log message data
• Flexible log data analytics with full power of SQL
• Seamless scalability & elasticity according to your log message volume
Use for analyzing application logs
IDUG Db2 Tech Conference
Charlotte, NC | June 2 – 6, 2019
Data Lake in IBM Cloud – How it works
IBM Cloud Data LakeData
Streaming
Upload
ETL
DB2
Feature
Extraction
Data
Prep
ICD
DB2
ICD
OLAP
Analytics WML
ETL
Federate
Asper
a
Cloudant
Replication
Secure
Sync
IBM
Blockchain
Application
s
Application
s Watson
Studio
Knowledge
Catalog
METASTORE
AI
ICP for DataAnalytics
Engine
IBM Cloud
Functions
Land Process Integrate
Key Protect
Index
Creation
Getting started: https://www.ibm.com/cloud/sql-query
SQL Query Intro Video: https://youtu.be/s-FznfHJpoU
SQL Query Starter Notebook in Watson Studio: https://ibm.biz/BdYNrN
SQL Reference: https://ibm.biz/Bd2jF7
SQL Query API doc: https://cloud.ibm.com/apidocs/sql-query
Big Data Layout Best Practices for COS: https://ibm.biz/Bd2jRg
Serverless Data & Analytics: https://ibm.biz/Bd2jF5
Further Resources
Backup
IDUG Db2 Tech Conference
Charlotte, NC | June 2 – 6, 2019
1. Identify friction points in users’ digital journey, e.g.:
• Clicks-2-purchase ratio
• Unexpected repeated page visits per user
• E.g. entering payment data should only happen once
• Last page visited per session
2. Identify click sequences for successful purchase
• Sequence matching using timeseries analysis
3. Identify customers/segments likely to churn or expand
• Look for typical page visits, actions or flows
• E.g. Terms & conditions, invite additional users etc.
4. Determining your most important content online
What Insights can I extract from a Clickstream?
1. Identify friction points in users’ digital journey, e.g.:
• Clicks-2-purchase ratio
• Unexpected repeated page visits per user
• E.g. entering payment data should only happen once
• Last page visited per session
2. Identify click sequences for successful purchase
• Sequence matching using timeseries analysis
3. Identify customers/segments likely to churn or expand
• Look for typical page visits, actions or flows
• E.g. Terms & conditions, invite additional users etc.
4. Determining your most important content online
What Insights can I extract from a Clickstream?
Building IBM Cloud-Native Data Lake
Serverless SQL
Serverless Storage
Serverless Pipeline
Automation ✓
✓
✓
Orchestration
Processing
Persistency Data Ingest
✓
Data Catalog ✓
Serverless
Unstructured Data
Processing ✓
• Traditional analytics systems
• Fixed capacities of appliances
• Specialized teams of data engineers & DBAs who manage data model, access and ETL
• BI analysts who have access only to the curated data sets in EDW
• Innovative enterprises today
• Wide range of teams that require direct access to same data set at all stages of the data
pipeline: BI analysts, data scientists, quantitative marketers, dev/ops, developers
• Data engineers that support these teams need a much, much more scalable and cost-
effective platform to ensure all teams have access they need and when needed
• Building analytics platforms in the cloud because of the scale and cost-efficiencies that
come with serverless analytics over object stores
Serverless – The key to IT Sharing Economy ... also for Analytics
Proper data organization è
better performance and lower cost
29
,
2
0
1
9
/
©
2
0
1
9
I
B
M
C
o
r
p
o
r
a
t
i
o
n
The key factors are:
• Number of bytes shipped
• Number of REST requests
Best practices for structured data:
• Choose the right object size (sweet spot: 128 MB)
• Choose the right format
• Choose the right data layout
• Avoid gzip compressed formats
Applies to SQL Query but also
applies to other Big Data engines
To learn more: https://www.ibm.com/blogs/bluemix/2018/06/big-data-layout/
Which Format is Query-Friendly?
2. Use Hive style partitioning
GPMeterStream/dt=2017-08-17/part-00085.csv
GPMeterStream/dt=2017-08-17/part-00086.csv
GPMeterStream/dt=2017-08-17/part-00087.csv
GPMeterStream/dt=2017-08-17/part-00088.csv
GPMeterStream/dt=2017-08-17/part-00089.csv
GPMeterStream/dt=2017-08-18/part-00001.csv
GPMeterStream/dt=2017-08-18/part-00002.csv
GPMeterStream/dt=2017-08-18/part-00003.csv
Avoid reading unnecessary objects altogether
Technique has limitations
Best Practice: minimize bytes scanned
1. Use Parquet
• Column based
• Only read the columns you need
• Column wise compression
• Min/max metadata
Table Locators
cos://<endpoint>/<bucket>/[<prefix>] <format definition>
Endpoint – of your object storage bucket or a short alias
E.g. s3.us-south.objectstorage.appdomain.cloud or alias us-south
Bucket – name in object storage
Prefix – one or multiple objects (i.e. table partitions) with same prefix
Used in FROM clauses for input data and in target field for result set data
Examples:
cos://us-south/myBucket/myFolder/mySubFolder/myData.parquet
cos://us/otherBucket/myData
cos://us/otherBucket/myData/part
cos://eu/newBucket/
<Table Locator> [JOBPREFIX JOBID | NONE]
[STORED AS CSV | PARQUET | JSON]
• Specifies the data format of the input data
• Table schema is automatically inferred at SQL execution time
• STORED AS Clause is optional, the default is CSV
• Additional parameters for CSV:
• E.g.: FIELDS TERMINATEY BY ‘t’ NOHEADER
• JOBPREFIX only for targets: defines unique prefix to append. Default is JOBID.
Table Format Definition
SELECT … INTO
<Table Locator> [STORED AS CSV | PARQUET | JSON]
[PARTITIONED [BY (<column list>)]
[INTO <num> BUCKETS]
[EVERY <num> ROWS]]
[SORT BY (<column list>)]
BY: Produces Hive Style Partitioning
INTO: Produced fix number of partitions (hash partitioned)
EVERY: Produces partitioned of even size (e.g. for pagination)
SORT BY: Exact result order & clustering when combined with PARTITIONED
Table Partitioning Definition
Submit a SQL query
POST https://api.sql-query.cloud.ibm.com/v2/sql_jobs
Runs the SQL in the background and returns a job_id
Detailed info for a SQL query (e.g. status, result location)
GET https://api.sql-query.cloud.ibm.com /v2/sql_jobs/{job_id}
Returns JSON with query execution details
List of recent SQL query executions
GET https://api.sql-query.cloud.ibm.com /v2/sql_jobs
Returns JSON array with last 30 SQL submissions and outcomes
IBM SQL Query REST API
IDUG Db2 Tech Conference
Charlotte, NC | June 2 – 6, 2019
Scaling Analytics: Data Skipping Saving you Time and $
Index All
Objects
IBM Cloud Object Storage
Data Set Objects
SQL
Query
Data Skipping
Indexing
Candidate
Objects
WHERE Clause
Saving Time
and $
SQL Query learns which objects are not relevant to a query
using a data skipping index
CREATE METAINDEX stores index summary metadata for
each object. Much smaller than the data.
SQLs skipping irrelevant objects to significantly reduce I/O
E.g.:
Independent of data formats
Index Types: Min/Max, Value List, Bounding Box
Get location and time of heat waves (>40 celcius)
SELECT lat, long, city, temp, date
FROM weather
WHERE temp > 40.0
Scaling Analytics: Data Skipping Saving you Time and $
Index All
Objects
IBM Cloud Object Storage
Data Set Objects
SQL
Query
Data Skipping
Indexing
Candidate
Objects
WHERE Clause
Saving Time
and $
SQL Query learns which objects are not relevant to a query
using a data skipping index
CREATE METAINDEX stores index summary metadata for
each object. Much smaller than the data.
SQLs skipping irrelevant objects to significantly reduce I/O
E.g.:
Independent of data formats
Index Types: Min/Max, Value List, Bounding Box
Get location and time of heat waves (>40 celcius)
SELECT lat, long, city, temp, date
FROM weather
WHERE temp > 40.0
• JDBC compliant driver library that wraps REST API
• Wrapping both, SQL Query and COS REST API
• Exposing regular session interface (JDBC Connection)
• Enabling custom JDBC application support
• Enabling BI application support
• Early adopter: Looker
• Support for stored table meta data (simple catalog)
• Stored as json in COS and referenced via JDBC
connection string
• I.e. DatabaseMetaData interface also supported
JDBC Driver for BI Applications
Apply for Beta Now
Query
JDBC Driver
REST
COS
JDBC
API
DataResult
Sets
Table
Catalog
E.g. Looker
Using SQL Query JDBC Driver
Define table catalog
• JSON file in COS containing:
• Table name
• Location of table objects on COS
• Object format
• Column names
• Column types
• INT, FLOAT, VARCHAR, TIMESTAMP
JDBC Connection String:
jdbc:SQLQuery:<sql-query instance crn>
?schemabucket=<COS bucket with json catalog>
?schemafile=<COS object with json catalog>
&apikey=<api key for your account>
&targetcosurl=<COS URL for result set>
Think 2019 / 2263 / February 2019 / © 2019 IBM Corporation
IBM Cloud Functions
Fair Never pay for idle
Polyglot
Elastic
Automation
Triggers
Open Source
CLOUD
FUNCTIONS
Schedules
Sequences

More Related Content

PPTX
Analyzing StackExchange data with Azure Data Lake
PDF
Azure Data Factory V2; The Data Flows
PPTX
Azure Stream Analytics : Analyse Data in Motion
PDF
Customer migration to Azure SQL database, December 2019
PPTX
Azure data bricks by Eugene Polonichko
PDF
Azure - Data Platform
PPTX
Azure Cosmos DB + Gremlin API in Action
PDF
Is there a way that we can build our Azure Synapse Pipelines all with paramet...
Analyzing StackExchange data with Azure Data Lake
Azure Data Factory V2; The Data Flows
Azure Stream Analytics : Analyse Data in Motion
Customer migration to Azure SQL database, December 2019
Azure data bricks by Eugene Polonichko
Azure - Data Platform
Azure Cosmos DB + Gremlin API in Action
Is there a way that we can build our Azure Synapse Pipelines all with paramet...

What's hot (16)

PPTX
SQL to NoSQL: Top 6 Questions
PPTX
Azure Databricks is Easier Than You Think
PDF
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
PDF
Moving to the cloud; PaaS, IaaS or Managed Instance
PPTX
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
PDF
Accessing Google Cloud APIs
PPTX
NoSQL, which way to go?
PPTX
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
PPTX
Bi and AI updates in the Microsoft Data Platform stack
PPTX
Discovery Day 2019 Sofia - Big data clusters
PDF
Microsoft SQL server 2017 Level 300 technical deck
PPTX
Discovery Day 2019 Sofia - What is new in SQL Server 2019
PPTX
Modern ETL: Azure Data Factory, Data Lake, and SQL Database
PPTX
PaaSport to Paradise: Lifting & Shifting with Azure SQL Database/Managed Inst...
PPTX
Azure data platform overview
PDF
Hyun joong
SQL to NoSQL: Top 6 Questions
Azure Databricks is Easier Than You Think
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
Moving to the cloud; PaaS, IaaS or Managed Instance
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Accessing Google Cloud APIs
NoSQL, which way to go?
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
Bi and AI updates in the Microsoft Data Platform stack
Discovery Day 2019 Sofia - Big data clusters
Microsoft SQL server 2017 Level 300 technical deck
Discovery Day 2019 Sofia - What is new in SQL Server 2019
Modern ETL: Azure Data Factory, Data Lake, and SQL Database
PaaSport to Paradise: Lifting & Shifting with Azure SQL Database/Managed Inst...
Azure data platform overview
Hyun joong
Ad

Similar to Serverless SQL (20)

PDF
IBM Cloud Day January 2021 Data Lake Deep Dive
PPTX
Coud-based Data Lake for Analytics and AI
PDF
IBM Cloud Day January 2021 - A well architected data lake
PDF
IBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM Cloud
PDF
IBM Cloud Native Day April 2021: Serverless Data Lake
PPTX
Cloud-based Data Lake for Analytics and AI
PPTX
Azure Data Factory for Azure Data Week
PDF
Webinar: SQL for Machine Data?
PDF
Estimating the Total Costs of Your Cloud Analytics Platform
PPTX
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
PPTX
Building Modern Data Platform with Microsoft Azure
PDF
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
PPTX
Building cloud native data microservice
PDF
Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...
PPTX
Lecture 5- Data Collection and Storage.pptx
PDF
Suburface 2021 IBM Cloud Data Lake
PPTX
Aws meetup 20190427
PPTX
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
PDF
Serverless Data Platform
PDF
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
IBM Cloud Day January 2021 Data Lake Deep Dive
Coud-based Data Lake for Analytics and AI
IBM Cloud Day January 2021 - A well architected data lake
IBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM Cloud
IBM Cloud Native Day April 2021: Serverless Data Lake
Cloud-based Data Lake for Analytics and AI
Azure Data Factory for Azure Data Week
Webinar: SQL for Machine Data?
Estimating the Total Costs of Your Cloud Analytics Platform
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
Building Modern Data Platform with Microsoft Azure
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
Building cloud native data microservice
Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...
Lecture 5- Data Collection and Storage.pptx
Suburface 2021 IBM Cloud Data Lake
Aws meetup 20190427
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Serverless Data Platform
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Ad

More from Torsten Steinbach (11)

PPTX
IBM THINK 2020 - Cloud Data Lake with IBM Cloud Data Services
PDF
IBM THINK 2019 - What? I Don't Need a Database to Do All That with SQL?
PDF
IBM THINK 2019 - Cloud-Native Clickstream Analysis in IBM Cloud
PDF
IBM THINK 2019 - Self-Service Cloud Data Management with SQL
PPTX
IBM THINK 2018 - IBM Cloud SQL Query Introduction
PPTX
IBM Insight 2014 - Advanced Warehouse Analytics in the Cloud
PPT
IBM Insight 2015 - 1823 - Geospatial analytics with dashDB in the cloud
PPT
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
PPT
IBM InterConnect 2016 - 3505 - Cloud-Based Analytics of The Weather Company i...
PPTX
IBM Information on Demand 2013 - Session 2839 - Using IBM PureData System fo...
PDF
esri2015cloudantdashdbpresentation-150731203041-lva1-app6892
IBM THINK 2020 - Cloud Data Lake with IBM Cloud Data Services
IBM THINK 2019 - What? I Don't Need a Database to Do All That with SQL?
IBM THINK 2019 - Cloud-Native Clickstream Analysis in IBM Cloud
IBM THINK 2019 - Self-Service Cloud Data Management with SQL
IBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM Insight 2014 - Advanced Warehouse Analytics in the Cloud
IBM Insight 2015 - 1823 - Geospatial analytics with dashDB in the cloud
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM InterConnect 2016 - 3505 - Cloud-Based Analytics of The Weather Company i...
IBM Information on Demand 2013 - Session 2839 - Using IBM PureData System fo...
esri2015cloudantdashdbpresentation-150731203041-lva1-app6892

Recently uploaded (20)

PPTX
Sheep Seg. Marketing Plan_C2 2025 (1).pptx
PPTX
Capstone Presentation a.pptx on data sci
PPTX
Transport System for Biology students in the 11th grade
PPTX
Fkrjrkrkekekekeekkekswkjdjdjddwkejje.pptx
PPT
Handout for Lean and Six Sigma application
PDF
PPT nikita containers of the company use
PPT
Drug treatment of Malbbbbbhhbbbbhharia.ppt
PPT
What is life? We never know the answer exactly
PPTX
Sistem Informasi Manejemn-Sistem Manajemen Database
PDF
NU-MEP-Standards معايير تصميم جامعية .pdf
PPTX
research framework and review of related literature chapter 2
PDF
PPT IEPT 2025_Ms. Nurul Presentation 10.pdf
PPTX
DIGITAL DESIGN AND.pptx hhhhhhhhhhhhhhhhh
PPTX
AI-Augmented Business Process Management Systems
PDF
American Journal of Multidisciplinary Research and Review
PDF
Lesson 1 - intro Cybersecurity and Cybercrime.pptx.pdf
PPTX
Power BI - Microsoft Power BI is an interactive data visualization software p...
PPTX
Overview_of_Computing_Presentation.pptxxx
PDF
toaz.info-grade-11-2nd-quarter-earth-and-life-science-pr_5360bfd5a497b75f7ae4...
PPTX
ISO 9001-2015 quality management system presentation
Sheep Seg. Marketing Plan_C2 2025 (1).pptx
Capstone Presentation a.pptx on data sci
Transport System for Biology students in the 11th grade
Fkrjrkrkekekekeekkekswkjdjdjddwkejje.pptx
Handout for Lean and Six Sigma application
PPT nikita containers of the company use
Drug treatment of Malbbbbbhhbbbbhharia.ppt
What is life? We never know the answer exactly
Sistem Informasi Manejemn-Sistem Manajemen Database
NU-MEP-Standards معايير تصميم جامعية .pdf
research framework and review of related literature chapter 2
PPT IEPT 2025_Ms. Nurul Presentation 10.pdf
DIGITAL DESIGN AND.pptx hhhhhhhhhhhhhhhhh
AI-Augmented Business Process Management Systems
American Journal of Multidisciplinary Research and Review
Lesson 1 - intro Cybersecurity and Cybercrime.pptx.pdf
Power BI - Microsoft Power BI is an interactive data visualization software p...
Overview_of_Computing_Presentation.pptxxx
toaz.info-grade-11-2nd-quarter-earth-and-life-science-pr_5360bfd5a497b75f7ae4...
ISO 9001-2015 quality management system presentation

Serverless SQL

  • 2. SQL on Object Storage DM Gartner Hype Cycle 2018
  • 3. Evolution of Form Factors For Big Data Analytics Enterprise Data Warehouses Tightly integrated and optimized systems Hadoop Introduced open data formats & easy scaling on commodity HW Cloud-Native: Serverless Analytics-aaS • Seamless elasticity • Pay-per-query consumption • Analyze data as it sits in an object store • Disaggregated architecture • No more infrastructure head aches The 90-ies 2000 Today
  • 4. Ingredient 3: Serverless Data Transformation Ingredient 4: Serverless Analytics Ingredient 5: Serverless Automation Ingredient 2: Serverless Data Ingest Sharing Economy for Analytics Ingredient 1: Serverless Storage
  • 5. Object Storage IBM Cloud Object Storage Objects Objects Objects At Rest On the Wire Buckets Encrypted Pennies per GB REST Elastic Durable Flexible Resiliency Choices Storage Classes User Managed Encryption Keys S3 Compatible High Speed Data Transfer Aspera SQL Queries
  • 6. Data Ingest Options 6 High Customizability Degree of Serverless-ness IBM Event Streams (Kafka aaS) IBM Cloud Functions Out-of-the-Box IBM Streaming Analytics (IBM Streams aaS) via Cloud Object Storage API SQL Query ETL Cloudant Replication Blockchain Synch
  • 7. Cloud Data Data Transformation Serverless SQL Analytics IBM SQL Query Object Storage Db2 + Developers Data Engineers Data Analysts ü Perfect for Machine Generated Data ü Ad-hoc Data Exploration ü Operationalizing Data Pipelines ü Big Data Lakes ü Flexible Data Transformation ü Extremely affordable. 5$/TB scanned ü 100% API enabled ü Analytics on Object Storage ü Big Data Scale-Out. Running on Spark ü 100% Self service – No Setup
  • 8. 2. Read data 4. Read results Application 3. Write results IBM Cloud Object Storage Result SetData Set Data Set Data Set 1. Submit SQL SQL Archive / Export IBM Cloud Streaming IBM Streams Event Streams Land Query IBM Cloud Functions IBM SQL Query Architecture IBM Cloud Databases Db2 on Cloud Geospatial SQLData Skipping Timeseries SQL Upload
  • 9. Data Center 2 Analytics Engine Cluster 20 Kernels Node 1 Node 3 Node 2 Node 3 … 20 Kernels … Data Center 3 Analytics Engine Cluster 20 Kernels Node 1 Node 3 Node 2 Node 3 … 20 Kernels … SQL 1 SQL 1 Data Center 1 IBM Cloud SQL Query – Very High Level Architecture (MVP 1Q 2018) Analytics Engine Cluster 20 Kernels Cluster Pool Request Queue Node 1 Node 3 Node 2 Node 3 … Kernel Pools 20 Kernels … SQL 1 SQL 2 SQL 3 SQL 4 SQL 5 Cloud Object Storage SQL 6 … JKG (Web Sockets) IBM Cloud Query – Spark Cluster Architecture
  • 10. SQL REST API Create Query SQL Web Console Watson Studio Notebooks SQL Cloud Function Integrate Explore Deploy IBM Cloud Query – Access Patterns Node SDK Python SDK JDBCLooker
  • 11. Best of breed Spark SQL Reference • Complete, intuitive and interactive SQL Reference • Each sample SQL can immediately be executed as is https://cloud.ibm.com/docs/services/sql-query/sqlref/sql_reference.html#sql-reference Analytics using full Power of Spark SQL
  • 12. IBM SQL Query – Timeseries SQL 1/2 § Intuitive first-of-a-kind SQL extensions for timeseries operations § Industry leading differentiators, including: • Timeseries transformation functions: • Correlation, Fourier transformation, z-normalization, Granger, interpolation, and distances • Temporal Joins: SQL support for Left/Right/Full Inner and Outer joins of multiple timeseries Alignment & Joining:
  • 13. § Further Industry leading differentiators • Numerical and categorical timeseries types • Timeseries data skipping for fast queries • Forecasting: • ARIMA, BATS, Anomaly detection, etc. • Subsequence Mining: • Train & match models for event sequences • Segmentation: • Time-based, Record-based, Anchor-based, Burst, and silence Segmentation: IBM SQL Query – Timeseries SQL 2/2
  • 14. • IBM SQL Query – Spatial SQL § SQL/MM standard to store & analyze spatial data in RDBMS § Migration of PostGIS compliant SQL queries § Aggregation, computation and join via native SQL syntax § Industry leading differentiators • Geodetic Full Earth support • Increased developer productivity • Avoid piece-wise planar projections • High precision calculations anywhere on the earth • Very large polygons (e.g. countries), polar caps, x-ing anti-meridian • Spatial data skipping for fast queries • Native and fine-granular geohash support • Fast spatial aggregation
  • 15. Example: Spatio-Temporal Processing of Sensor Data IBM Cloud Object Storage Sensor Data Query Location Analytics Mobile Cars Devices Land Location Filtering Spatial Aggregation GPS SQL/MM Sensor Metrics t t t Timeseries Assembly Timeseries Join Timeseries SQL t
  • 17. Unstructured Data Prep SQL Query Cloud Functions Analyze COSCOS Extract Features Automated/Scheduled SQL Execution SQL Query Cloud Functions Develop SQL Deploy as SQL Cloud Function Set up Cloud Function Trigger/Schedule Shield Data From Direct Access SQL Query Cloud Functions Deploy Cloud Function with COS API Key User Calls Function to Access Data COS Grant Execute on SQL Cloud Function to User Configure SQL Pipelines SQL Query Cloud Functions User creates function sequence to automate flow of consecutive SQLs Sequence SQL Query Cloud Functions 1. 2. Use Cases of Cloud Functions Adding Value to SQL
  • 18. Ingredient 3: Serverless Data Transformation ✓ Ingredient 4: Serverless Analytics ✓ Ingredient 5: Serverless Automation ✓ Ingredient 2: Serverless Data Ingest ✓ Ingredient 1: Serverless Storage ✓ Now, what is this all good for?
  • 19. IBM Cloud Object Storage Acquire Query Data Warehouses & Databases Db2 on Cloud Process Analyze ApplicationsApplications Applications IoT Streaming Devices Devices Devices BI & AI Land Log Messages Cleanse Filter Merge Aggregate Compress Watson Studio Looker Cognos WML Explore Analyze Analyze Promote Use for Data Pipelines to fuel BI & AI
  • 20. Data –Driven Decisions ☛ Understanding system health, user behavior & workload status Collecting & Analyzing Log Data ☛ Is NOT and afterthought but rather foundation for decisions on system and feature design. Data Volume Growing Rapidly ☛ Growth rates and data volume at rest can jump dramatically. Very high elasticity is required. Competitive Advantage ☛ Is based on short runways for turning data into actions Turn your Logs into Business – Log Data Is The Cloud-Native Currency
  • 21. Logs Your Cloud Application/Solution IBM Cloud Object Storage Query Transform Compress Aggregate Repartition Analyze Anomaly Detection User Segmentation Customer Support Resource Planning • Build & run data pipelines and analytics of your log message data • Flexible log data analytics with full power of SQL • Seamless scalability & elasticity according to your log message volume Use for analyzing application logs
  • 22. IDUG Db2 Tech Conference Charlotte, NC | June 2 – 6, 2019 Data Lake in IBM Cloud – How it works IBM Cloud Data LakeData Streaming Upload ETL DB2 Feature Extraction Data Prep ICD DB2 ICD OLAP Analytics WML ETL Federate Asper a Cloudant Replication Secure Sync IBM Blockchain Application s Application s Watson Studio Knowledge Catalog METASTORE AI ICP for DataAnalytics Engine IBM Cloud Functions Land Process Integrate Key Protect Index Creation
  • 23. Getting started: https://www.ibm.com/cloud/sql-query SQL Query Intro Video: https://youtu.be/s-FznfHJpoU SQL Query Starter Notebook in Watson Studio: https://ibm.biz/BdYNrN SQL Reference: https://ibm.biz/Bd2jF7 SQL Query API doc: https://cloud.ibm.com/apidocs/sql-query Big Data Layout Best Practices for COS: https://ibm.biz/Bd2jRg Serverless Data & Analytics: https://ibm.biz/Bd2jF5 Further Resources
  • 25. IDUG Db2 Tech Conference Charlotte, NC | June 2 – 6, 2019 1. Identify friction points in users’ digital journey, e.g.: • Clicks-2-purchase ratio • Unexpected repeated page visits per user • E.g. entering payment data should only happen once • Last page visited per session 2. Identify click sequences for successful purchase • Sequence matching using timeseries analysis 3. Identify customers/segments likely to churn or expand • Look for typical page visits, actions or flows • E.g. Terms & conditions, invite additional users etc. 4. Determining your most important content online What Insights can I extract from a Clickstream?
  • 26. 1. Identify friction points in users’ digital journey, e.g.: • Clicks-2-purchase ratio • Unexpected repeated page visits per user • E.g. entering payment data should only happen once • Last page visited per session 2. Identify click sequences for successful purchase • Sequence matching using timeseries analysis 3. Identify customers/segments likely to churn or expand • Look for typical page visits, actions or flows • E.g. Terms & conditions, invite additional users etc. 4. Determining your most important content online What Insights can I extract from a Clickstream?
  • 27. Building IBM Cloud-Native Data Lake Serverless SQL Serverless Storage Serverless Pipeline Automation ✓ ✓ ✓ Orchestration Processing Persistency Data Ingest ✓ Data Catalog ✓ Serverless Unstructured Data Processing ✓
  • 28. • Traditional analytics systems • Fixed capacities of appliances • Specialized teams of data engineers & DBAs who manage data model, access and ETL • BI analysts who have access only to the curated data sets in EDW • Innovative enterprises today • Wide range of teams that require direct access to same data set at all stages of the data pipeline: BI analysts, data scientists, quantitative marketers, dev/ops, developers • Data engineers that support these teams need a much, much more scalable and cost- effective platform to ensure all teams have access they need and when needed • Building analytics platforms in the cloud because of the scale and cost-efficiencies that come with serverless analytics over object stores Serverless – The key to IT Sharing Economy ... also for Analytics
  • 29. Proper data organization è better performance and lower cost 29 , 2 0 1 9 / © 2 0 1 9 I B M C o r p o r a t i o n The key factors are: • Number of bytes shipped • Number of REST requests Best practices for structured data: • Choose the right object size (sweet spot: 128 MB) • Choose the right format • Choose the right data layout • Avoid gzip compressed formats Applies to SQL Query but also applies to other Big Data engines To learn more: https://www.ibm.com/blogs/bluemix/2018/06/big-data-layout/
  • 30. Which Format is Query-Friendly?
  • 31. 2. Use Hive style partitioning GPMeterStream/dt=2017-08-17/part-00085.csv GPMeterStream/dt=2017-08-17/part-00086.csv GPMeterStream/dt=2017-08-17/part-00087.csv GPMeterStream/dt=2017-08-17/part-00088.csv GPMeterStream/dt=2017-08-17/part-00089.csv GPMeterStream/dt=2017-08-18/part-00001.csv GPMeterStream/dt=2017-08-18/part-00002.csv GPMeterStream/dt=2017-08-18/part-00003.csv Avoid reading unnecessary objects altogether Technique has limitations Best Practice: minimize bytes scanned 1. Use Parquet • Column based • Only read the columns you need • Column wise compression • Min/max metadata
  • 32. Table Locators cos://<endpoint>/<bucket>/[<prefix>] <format definition> Endpoint – of your object storage bucket or a short alias E.g. s3.us-south.objectstorage.appdomain.cloud or alias us-south Bucket – name in object storage Prefix – one or multiple objects (i.e. table partitions) with same prefix Used in FROM clauses for input data and in target field for result set data Examples: cos://us-south/myBucket/myFolder/mySubFolder/myData.parquet cos://us/otherBucket/myData cos://us/otherBucket/myData/part cos://eu/newBucket/
  • 33. <Table Locator> [JOBPREFIX JOBID | NONE] [STORED AS CSV | PARQUET | JSON] • Specifies the data format of the input data • Table schema is automatically inferred at SQL execution time • STORED AS Clause is optional, the default is CSV • Additional parameters for CSV: • E.g.: FIELDS TERMINATEY BY ‘t’ NOHEADER • JOBPREFIX only for targets: defines unique prefix to append. Default is JOBID. Table Format Definition
  • 34. SELECT … INTO <Table Locator> [STORED AS CSV | PARQUET | JSON] [PARTITIONED [BY (<column list>)] [INTO <num> BUCKETS] [EVERY <num> ROWS]] [SORT BY (<column list>)] BY: Produces Hive Style Partitioning INTO: Produced fix number of partitions (hash partitioned) EVERY: Produces partitioned of even size (e.g. for pagination) SORT BY: Exact result order & clustering when combined with PARTITIONED Table Partitioning Definition
  • 35. Submit a SQL query POST https://api.sql-query.cloud.ibm.com/v2/sql_jobs Runs the SQL in the background and returns a job_id Detailed info for a SQL query (e.g. status, result location) GET https://api.sql-query.cloud.ibm.com /v2/sql_jobs/{job_id} Returns JSON with query execution details List of recent SQL query executions GET https://api.sql-query.cloud.ibm.com /v2/sql_jobs Returns JSON array with last 30 SQL submissions and outcomes IBM SQL Query REST API
  • 36. IDUG Db2 Tech Conference Charlotte, NC | June 2 – 6, 2019 Scaling Analytics: Data Skipping Saving you Time and $ Index All Objects IBM Cloud Object Storage Data Set Objects SQL Query Data Skipping Indexing Candidate Objects WHERE Clause Saving Time and $ SQL Query learns which objects are not relevant to a query using a data skipping index CREATE METAINDEX stores index summary metadata for each object. Much smaller than the data. SQLs skipping irrelevant objects to significantly reduce I/O E.g.: Independent of data formats Index Types: Min/Max, Value List, Bounding Box Get location and time of heat waves (>40 celcius) SELECT lat, long, city, temp, date FROM weather WHERE temp > 40.0
  • 37. Scaling Analytics: Data Skipping Saving you Time and $ Index All Objects IBM Cloud Object Storage Data Set Objects SQL Query Data Skipping Indexing Candidate Objects WHERE Clause Saving Time and $ SQL Query learns which objects are not relevant to a query using a data skipping index CREATE METAINDEX stores index summary metadata for each object. Much smaller than the data. SQLs skipping irrelevant objects to significantly reduce I/O E.g.: Independent of data formats Index Types: Min/Max, Value List, Bounding Box Get location and time of heat waves (>40 celcius) SELECT lat, long, city, temp, date FROM weather WHERE temp > 40.0
  • 38. • JDBC compliant driver library that wraps REST API • Wrapping both, SQL Query and COS REST API • Exposing regular session interface (JDBC Connection) • Enabling custom JDBC application support • Enabling BI application support • Early adopter: Looker • Support for stored table meta data (simple catalog) • Stored as json in COS and referenced via JDBC connection string • I.e. DatabaseMetaData interface also supported JDBC Driver for BI Applications Apply for Beta Now Query JDBC Driver REST COS JDBC API DataResult Sets Table Catalog E.g. Looker
  • 39. Using SQL Query JDBC Driver Define table catalog • JSON file in COS containing: • Table name • Location of table objects on COS • Object format • Column names • Column types • INT, FLOAT, VARCHAR, TIMESTAMP JDBC Connection String: jdbc:SQLQuery:<sql-query instance crn> ?schemabucket=<COS bucket with json catalog> ?schemafile=<COS object with json catalog> &apikey=<api key for your account> &targetcosurl=<COS URL for result set>
  • 40. Think 2019 / 2263 / February 2019 / © 2019 IBM Corporation
  • 41. IBM Cloud Functions Fair Never pay for idle Polyglot Elastic Automation Triggers Open Source CLOUD FUNCTIONS Schedules Sequences