SlideShare a Scribd company logo
HOW to use First steps
© 2022 Cloudera, Inc. All rights reserved. 2
Recommended Iceberg Workflow
Create Iceberg
tables
a. Bring your own
datasets by
converting your
Hive external
tables
OR
b. Use the sample
airline datasets
CDW: Hive
CDE: Spark SQL
1
Batch Insert
data
To prepare Time
Travel scenario:
Insert more data into
Iceberg tables with
Hive or Spark
CDE: Spark SQL
2
Create Security
Policy
Create a Ranger
policy to mask a
column for Fine
Grained Access
Control (FGAC)
SDX: Ranger
3
Build BI Query
Create SQL Queries
for standard ops.
reporting
CDW: Impala SQL
4
Build
Visualizations
Create data sets &
Visuals from Query
CDV: Create data set
from query & Build
Visuals
5
Perform Time
Travel
Create Time Travel
Queries and
Execute them to
audit what has
changed
CDW: Hive/Impala SQL
CDE: Spark Scala API
6
Partition
Evolution
Optimize partition
schema to improve
query performance
CDW: Hive/Impala SQL
CDE: Spark SQL
7
Table
Maintenance
Manage / Expire
Snapshots
CDE: Spark SQL
8
CREATE INGEST/ PREP SERVE OPERATION / MAINTENANCE
GOVERN
© 2022 Cloudera, Inc. All rights reserved. 3
SQL Commands ( Hive, Spark, Impala)
© 2022 Cloudera, Inc. All rights reserved. 4
SQL Commands
Iceberg
Tables
T
a
b
l
e
C
o
n
v
e
r
s
i
o
n
Tim
e Travel
DDL
Query
D
M
L
Ease of Use through consistent SQL Syntax across compute engines
Rich set of SQL commands are developed
for Hive, Impala and Spark to
• Create and manipulate database objects
• Run Queries
• Load data into tables
• Modify data in tables
• Perform Time Travel operations
• Convert to Iceberg tables
© 2022 Cloudera, Inc. All rights reserved. 5
Snapshot of Iceberg SQL Commands
Hive Impala Spark
Select ⬤ ⬤ ⬤
DML (INSERT INTO, INSERT OVERWRITE) ⬤ ⬤ ⬤
Create Table ⬤ ⬤ ⬤
Alter Table ⬤ ⬤ ⬤
Drop Table ⬤ ⬤ ⬤
Truncate Table ⬤ ⬤ NA
Create-Table-As-Select ⬤ ⬤ ⬤
Replace-Table-As-Select NA NA ⬤
Partition Evolution ⬤ ⬤ ⬤
Partition Transformation ⬤ ⬤ ⬤
Schema Evolution ⬤ ⬤ ⬤
Table Metadata (DESCRIBE TABLE, SHOW CREATE
TABLE)
⬤ ⬤ ⬤
Time Travel ⬤ ⬤ Scala API now, SQL is planned
Table Migration ⬤ NA ⬤
Table Maintenance NA NA ⬤
⬤ General Availability
⬤ Tech Preview
Compute Engines Interoperability &
Fine Grained Access Control
© 2022 Cloudera, Inc. All rights reserved. 7
Compute Engine Interoperability & FGAC
❏ Consistent Iceberg table access and
processing with SQL using Hive, Spark and
Impala (reads and writes)
❏ No partial reads
❏ No adapters needed
❏ Iceberg FGAC support through Ranger
integration with Hive / Impala
❏ Spark is planned
❏ Compatible with existing workflows
❏ Optimized for performance, cost and
developer efficiency
Iceberg Tables
Apache Impala
Table Conversion SQL commands /
Utility [Tech Preview]
© 2022 Cloudera, Inc. All rights reserved. 9
Table Conversion from Hive External to Iceberg Tables
1. Hive table migration:
ALTER TABLE tbl SET TBLPROPERTIES
(‘storage_handler’=’org.apache.iceberg.mr.hive.HiveIcebergStorageHandler’)
2. Spark 3:
a. Import Hive tables into Iceberg
spark.sql("CALL <catalog>.system.snapshot('<src>', '<dest>')")
b. Migrate Hive tables to Iceberg tables
spark.sql("CALL <catalog>.system.migrate('<src>')")
Time Travel Operations
© 2022 Cloudera, Inc. All rights reserved. 11
Time Travel
t
Time Travel is the ability to make a query reproducible at a given snapshot and/or time
Time Travel operations:
● SELECT … AS OF …
Apache Impala
Snapshot A Snapshot Z
Standard SQL operations:
● Queries
● DDL
● DML
t
|
|
T
0
© 2022 Cloudera, Inc. All rights reserved. 12
Time Travel Operations
Time Travel Ops SQL Examples
Hive / Impala
Query
SELECT * FROM table FOR SYSTEM_TIME AS OF ’2021-08-09 10:35:57’;
SELECT * FROM table FOR SYSTEM_VERSION AS OF 1234567;
Spark Scala API // time travel to snapshot with ID 10963874102873L
spark.read
.option("snapshot-id", 10963874102873L)
.format("iceberg")
.load("path/to/table")
// time travel to October 26, 1986 at 01:21:00
spark.read
.option("as-of-timestamp", "499162860000")
.format("iceberg")
.load("path/to/table")
Partition Evolution
© 2022 Cloudera, Inc. All rights reserved. 14
In-place Partition Evolution
❏ Existing big data solution doesn’t support in-place
partition evolution. Entire table must be completely
rewritten with new partition column
❏ With Iceberg’s hidden partition, a separation between
physical and logical, users are not required to maintain
partition columns.
❏ Iceberg tables can evolve partition schemas over time
as data volume changes.
❏ Benefits:
❏ No costly table rewrites or table migration
❏ No query rewrites
❏ Reduce downtime and improve SLA
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31
2022-01-01 t
Partitions included in query plan
Partitioned by Month(date) Partitioned by Day(date)
2021-10-01 2021-11-01 2021-12-01 2022-01…
SELECT * FROM SALES_ORDER
WHERE
DATE > 2021-11-23 AND
DATA < 2022-01-19
Split plan 1 Split plan 2
© 2022 Cloudera, Inc. All rights reserved. 15
Partition Evolution SQL examples
Engine SQL Examples
Hive / Impala // Partition evolution to hour
ALTER TABLE t SET PARTITION SPEC (hour(ts))
Spark SQL // Partition evolution to hour
ALTER TABLE t ADD PARTITION FIELD (hour(ts))
Table Maintenance [ Tech Preview ]
© 2022 Cloudera, Inc. All rights reserved. 17
Table Maintenance [ Tech Preview ]
Time Travel Ops Examples
Hive / Impala
Query
// Tentative, Proposed Syntax, not in GA
// Expires snapshots that are older than 7 days.
ALTER TABLE test_table EXECUTE expire_snapshots_lt
(now() - interval 7
days);
Spark Scala API // Not in GA
// Expires snapshots that are older than 7 day
Table test_table = …
long tsToExpire = System.currentTimeMillis() - (1000*60*60*24*7);
test_table.expireSnapshots()
.expireOlderThan(tsToExpire)
.commit();
Expiring old snapshots removes them from metadata, so they are no longer available for time travel operations. Data files are
not deleted until they are no longer referenced by a snapshot that may be used for time travel. Regularly expiring snapshots
deletes unused data files.

More Related Content

PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
PDF
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
PDF
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Adam Doyle
 
PDF
Understanding Query Plans and Spark UIs
Databricks
 
PPTX
Delta Lake with Azure Databricks
Dustin Vannoy
 
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
PDF
Spark with Delta Lake
Knoldus Inc.
 
PPTX
Optimizing Apache Spark SQL Joins
Databricks
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Adam Doyle
 
Understanding Query Plans and Spark UIs
Databricks
 
Delta Lake with Azure Databricks
Dustin Vannoy
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
Spark with Delta Lake
Knoldus Inc.
 
Optimizing Apache Spark SQL Joins
Databricks
 

What's hot (20)

PDF
The Parquet Format and Performance Optimization Opportunities
Databricks
 
PDF
Databricks Delta Lake and Its Benefits
Databricks
 
PPTX
Azure DataBricks for Data Engineering by Eugene Polonichko
Dimko Zhluktenko
 
PDF
Intro to Delta Lake
Databricks
 
PPTX
Presto best practices for Cluster admins, data engineers and analysts
Shubham Tagra
 
PPTX
Delta lake and the delta architecture
Adam Doyle
 
PPTX
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
 
PPTX
Apache Kudu: Technical Deep Dive


Cloudera, Inc.
 
PDF
Getting Started with Delta Lake on Databricks
Knoldus Inc.
 
PDF
Building an open data platform with apache iceberg
Alluxio, Inc.
 
PPTX
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
James Serra
 
PDF
Making Apache Spark Better with Delta Lake
Databricks
 
PDF
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
Altinity Ltd
 
PDF
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
PDF
Moving to Databricks & Delta
Databricks
 
PPTX
Introducing the Snowflake Computing Cloud Data Warehouse
Snowflake Computing
 
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks
 
PPTX
Snowflake + Power BI: Cloud Analytics for Everyone
Angel Abundez
 
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Databricks Delta Lake and Its Benefits
Databricks
 
Azure DataBricks for Data Engineering by Eugene Polonichko
Dimko Zhluktenko
 
Intro to Delta Lake
Databricks
 
Presto best practices for Cluster admins, data engineers and analysts
Shubham Tagra
 
Delta lake and the delta architecture
Adam Doyle
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
 
Apache Kudu: Technical Deep Dive


Cloudera, Inc.
 
Getting Started with Delta Lake on Databricks
Knoldus Inc.
 
Building an open data platform with apache iceberg
Alluxio, Inc.
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
James Serra
 
Making Apache Spark Better with Delta Lake
Databricks
 
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
Altinity Ltd
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
Moving to Databricks & Delta
Databricks
 
Introducing the Snowflake Computing Cloud Data Warehouse
Snowflake Computing
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks
 
Snowflake + Power BI: Cloud Analytics for Everyone
Angel Abundez
 
Ad

Similar to Some Iceberg Basics for Beginners (CDP).pdf (20)

PDF
Icebergs Best Secret A Guide to Metadata Tables
Szehon Ho
 
PPTX
iceberg introduction.pptx
Dori Waldman
 
PDF
Hive Quick Start Tutorial
Carl Steinbach
 
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
PPTX
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Cloudera, Inc.
 
PDF
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Anant Corporation
 
PDF
Cloudera Impala technical deep dive
huguk
 
PDF
More Than Just The Tip Of The Iceberg.pdf
Michal Gancarski
 
PPTX
The Impala Cookbook
Cloudera, Inc.
 
PPTX
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
Cloudera, Inc.
 
PDF
Presto Summit 2018 - 09 - Netflix Iceberg
kbajda
 
PDF
Impala SQL Support
Yue Chen
 
PDF
What's New in Apache Hive
DataWorks Summit
 
PDF
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
 
PDF
Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...
Dataconomy Media
 
PPTX
Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...
Cloudera, Inc.
 
PPTX
Time-Travel.pptx
BhagyaLakshmi425734
 
PDF
SQL on Hadoop
Doron Vainrub
 
PDF
Marcel Kornacker, Software Enginner at Cloudera - "Data modeling for data sci...
Dataconomy Media
 
PPTX
Building data pipelines with kite
Joey Echeverria
 
Icebergs Best Secret A Guide to Metadata Tables
Szehon Ho
 
iceberg introduction.pptx
Dori Waldman
 
Hive Quick Start Tutorial
Carl Steinbach
 
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Cloudera, Inc.
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Anant Corporation
 
Cloudera Impala technical deep dive
huguk
 
More Than Just The Tip Of The Iceberg.pdf
Michal Gancarski
 
The Impala Cookbook
Cloudera, Inc.
 
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
Cloudera, Inc.
 
Presto Summit 2018 - 09 - Netflix Iceberg
kbajda
 
Impala SQL Support
Yue Chen
 
What's New in Apache Hive
DataWorks Summit
 
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
 
Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...
Dataconomy Media
 
Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...
Cloudera, Inc.
 
Time-Travel.pptx
BhagyaLakshmi425734
 
SQL on Hadoop
Doron Vainrub
 
Marcel Kornacker, Software Enginner at Cloudera - "Data modeling for data sci...
Dataconomy Media
 
Building data pipelines with kite
Joey Echeverria
 
Ad

Recently uploaded (20)

PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
Doc9.....................................
SofiaCollazos
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Doc9.....................................
SofiaCollazos
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 

Some Iceberg Basics for Beginners (CDP).pdf

  • 1. HOW to use First steps
  • 2. © 2022 Cloudera, Inc. All rights reserved. 2 Recommended Iceberg Workflow Create Iceberg tables a. Bring your own datasets by converting your Hive external tables OR b. Use the sample airline datasets CDW: Hive CDE: Spark SQL 1 Batch Insert data To prepare Time Travel scenario: Insert more data into Iceberg tables with Hive or Spark CDE: Spark SQL 2 Create Security Policy Create a Ranger policy to mask a column for Fine Grained Access Control (FGAC) SDX: Ranger 3 Build BI Query Create SQL Queries for standard ops. reporting CDW: Impala SQL 4 Build Visualizations Create data sets & Visuals from Query CDV: Create data set from query & Build Visuals 5 Perform Time Travel Create Time Travel Queries and Execute them to audit what has changed CDW: Hive/Impala SQL CDE: Spark Scala API 6 Partition Evolution Optimize partition schema to improve query performance CDW: Hive/Impala SQL CDE: Spark SQL 7 Table Maintenance Manage / Expire Snapshots CDE: Spark SQL 8 CREATE INGEST/ PREP SERVE OPERATION / MAINTENANCE GOVERN
  • 3. © 2022 Cloudera, Inc. All rights reserved. 3 SQL Commands ( Hive, Spark, Impala)
  • 4. © 2022 Cloudera, Inc. All rights reserved. 4 SQL Commands Iceberg Tables T a b l e C o n v e r s i o n Tim e Travel DDL Query D M L Ease of Use through consistent SQL Syntax across compute engines Rich set of SQL commands are developed for Hive, Impala and Spark to • Create and manipulate database objects • Run Queries • Load data into tables • Modify data in tables • Perform Time Travel operations • Convert to Iceberg tables
  • 5. © 2022 Cloudera, Inc. All rights reserved. 5 Snapshot of Iceberg SQL Commands Hive Impala Spark Select ⬤ ⬤ ⬤ DML (INSERT INTO, INSERT OVERWRITE) ⬤ ⬤ ⬤ Create Table ⬤ ⬤ ⬤ Alter Table ⬤ ⬤ ⬤ Drop Table ⬤ ⬤ ⬤ Truncate Table ⬤ ⬤ NA Create-Table-As-Select ⬤ ⬤ ⬤ Replace-Table-As-Select NA NA ⬤ Partition Evolution ⬤ ⬤ ⬤ Partition Transformation ⬤ ⬤ ⬤ Schema Evolution ⬤ ⬤ ⬤ Table Metadata (DESCRIBE TABLE, SHOW CREATE TABLE) ⬤ ⬤ ⬤ Time Travel ⬤ ⬤ Scala API now, SQL is planned Table Migration ⬤ NA ⬤ Table Maintenance NA NA ⬤ ⬤ General Availability ⬤ Tech Preview
  • 6. Compute Engines Interoperability & Fine Grained Access Control
  • 7. © 2022 Cloudera, Inc. All rights reserved. 7 Compute Engine Interoperability & FGAC ❏ Consistent Iceberg table access and processing with SQL using Hive, Spark and Impala (reads and writes) ❏ No partial reads ❏ No adapters needed ❏ Iceberg FGAC support through Ranger integration with Hive / Impala ❏ Spark is planned ❏ Compatible with existing workflows ❏ Optimized for performance, cost and developer efficiency Iceberg Tables Apache Impala
  • 8. Table Conversion SQL commands / Utility [Tech Preview]
  • 9. © 2022 Cloudera, Inc. All rights reserved. 9 Table Conversion from Hive External to Iceberg Tables 1. Hive table migration: ALTER TABLE tbl SET TBLPROPERTIES (‘storage_handler’=’org.apache.iceberg.mr.hive.HiveIcebergStorageHandler’) 2. Spark 3: a. Import Hive tables into Iceberg spark.sql("CALL <catalog>.system.snapshot('<src>', '<dest>')") b. Migrate Hive tables to Iceberg tables spark.sql("CALL <catalog>.system.migrate('<src>')")
  • 11. © 2022 Cloudera, Inc. All rights reserved. 11 Time Travel t Time Travel is the ability to make a query reproducible at a given snapshot and/or time Time Travel operations: ● SELECT … AS OF … Apache Impala Snapshot A Snapshot Z Standard SQL operations: ● Queries ● DDL ● DML t | | T 0
  • 12. © 2022 Cloudera, Inc. All rights reserved. 12 Time Travel Operations Time Travel Ops SQL Examples Hive / Impala Query SELECT * FROM table FOR SYSTEM_TIME AS OF ’2021-08-09 10:35:57’; SELECT * FROM table FOR SYSTEM_VERSION AS OF 1234567; Spark Scala API // time travel to snapshot with ID 10963874102873L spark.read .option("snapshot-id", 10963874102873L) .format("iceberg") .load("path/to/table") // time travel to October 26, 1986 at 01:21:00 spark.read .option("as-of-timestamp", "499162860000") .format("iceberg") .load("path/to/table")
  • 14. © 2022 Cloudera, Inc. All rights reserved. 14 In-place Partition Evolution ❏ Existing big data solution doesn’t support in-place partition evolution. Entire table must be completely rewritten with new partition column ❏ With Iceberg’s hidden partition, a separation between physical and logical, users are not required to maintain partition columns. ❏ Iceberg tables can evolve partition schemas over time as data volume changes. ❏ Benefits: ❏ No costly table rewrites or table migration ❏ No query rewrites ❏ Reduce downtime and improve SLA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 2022-01-01 t Partitions included in query plan Partitioned by Month(date) Partitioned by Day(date) 2021-10-01 2021-11-01 2021-12-01 2022-01… SELECT * FROM SALES_ORDER WHERE DATE > 2021-11-23 AND DATA < 2022-01-19 Split plan 1 Split plan 2
  • 15. © 2022 Cloudera, Inc. All rights reserved. 15 Partition Evolution SQL examples Engine SQL Examples Hive / Impala // Partition evolution to hour ALTER TABLE t SET PARTITION SPEC (hour(ts)) Spark SQL // Partition evolution to hour ALTER TABLE t ADD PARTITION FIELD (hour(ts))
  • 16. Table Maintenance [ Tech Preview ]
  • 17. © 2022 Cloudera, Inc. All rights reserved. 17 Table Maintenance [ Tech Preview ] Time Travel Ops Examples Hive / Impala Query // Tentative, Proposed Syntax, not in GA // Expires snapshots that are older than 7 days. ALTER TABLE test_table EXECUTE expire_snapshots_lt (now() - interval 7 days); Spark Scala API // Not in GA // Expires snapshots that are older than 7 day Table test_table = … long tsToExpire = System.currentTimeMillis() - (1000*60*60*24*7); test_table.expireSnapshots() .expireOlderThan(tsToExpire) .commit(); Expiring old snapshots removes them from metadata, so they are no longer available for time travel operations. Data files are not deleted until they are no longer referenced by a snapshot that may be used for time travel. Regularly expiring snapshots deletes unused data files.