Apache Carbondata: An Indexed Columnar File Format for Interactive Query with Spark SQL: Spark Summit East talk by Jacky Li and Jihong Ma

Apache CarbonData:
An indexed columnar file format for
interactive query with Spark SQL
Jihong MA, Jacky LI
HUAWEI

data
Report & Dashboard OLAP & Ad-hoc Batch processing Machine learningReal TimeAnalytics

Challenge
• Wide Spectrum of Query Analysis
• OLAP Vs Detailed Query
• Full scan Vs Small scan
• Point queries
Small Scan QueryBig Scan Query
Multi-dimensional OLAP Query

How to choose storage engine to
facilitate query execution?

Available Options
1. NoSQL Database
• Key-Value store: low latency, <5ms
• No Standard SQL support
2. MPP relational Database
•Shared-nothing enables fast query execution
•Poor scalability: < 100 cluster size, no fault-tolerance
3. Search Engine
•Advanced indexing technique for fast search
•3~4X data expansion in size, no SQL support
4. SQL on Hadoop
•Modern distributed architecture and high scalability
•Slow on point queries

Motivation for A New File Format
CarbonData: Unified File Format
Small Scan QueryBig Scan Query
Multi-dimensional OLAP Query
A single copy of data
balanced to fit all data access

Apache
• Apache Incubator Project since June, 2016
• Apache releases
• 4 stable releases
• Latest 1.0.0, Jan 28, 2017
• Contributors:
• In Production:
Compute
Storage

Introducing CarbonData
CarbonData Integration with Spark
CarbonData Table
CarbonData File
What is CarbonData file format?
What forms a CarbonData table on disk?
What it takes to deeply integrate with
distributed processing engine like Spark?
1
2
3

CarbonData:
An Indexed Columnar File Format

CarbonData File Structure
§ Built-in Columnar & Index
§ Multi-dimensional Index (B+ Tree)
§ Min/Max index
§ Inverted index
§ Encoding:
§ RLE, Delta, Global Dictionary
§ Snappy for compression
§ Adaptive Data Type Compression
§ Data Type:
§ Primitive type and nested type

CarbonData File Layout
• Blocklet:A set of rows in columnarformat
•Data are sorted along MDK (multi-dimensional keys)
•Clustered data enabling efficient filtering and scan
•Column chunk: Data for one column in a Blocklet
•Footer: Metadata information
•File level metadata & statistics
•Schema
•Blocklet Index
Carbon Data File
Blocklet 1
Column 1 Chunk
Column 2 Chunk
…
Column n Chunk
Blocklet N
File Footer
…
File Metadata
Schema
Blocklet Index

File Level Blocklet Index
Blocklet 1
1 1 1 1 1 1 12000
1 1 1 2 1 2 5000
1 1 2 1 1 1 12000
1 1 2 2 1 2 5000
1 1 3 1 1 1 12000
1 1 3 2 1 2 5000
Blocklet 2
1 2 1 3 2 3 11000
1 2 2 3 2 3 11000
1 2 3 3 2 3 11000
1 3 1 4 3 4 2000
1 3 1 5 3 4 1000
1 3 2 4 3 4 2000
Blocklet 3
1 3 2 5 3 4 1000
1 3 3 4 3 4 2000
1 3 3 5 3 4 1000
1 4 1 4 1 1 20000
1 4 2 4 1 1 20000
1 4 3 4 1 1 20000
Blocklet 4
2 1 1 1 1 1 12000
2 1 1 2 1 2 5000
2 1 2 1 1 1 12000
2 1 2 2 1 2 5000
2 1 3 1 1 1 12000
2 1 3 2 1 2 5000
Blocklet Index
Blocklet1
Start Key1
End Key1 Start Key1
End Key4
Start Key1
End Key2
Start Key3
End Key4
Start Key1
End Key1
Start Key2
End Key2
Start Key3
End Key3
Start Key4
End Key4
File FooterBlocklet
• Build in-memory file level MDK index tree for filtering
• Major optimization for efficient scan
C1(Min, Max)
….
C7(Min, Max)
Blocklet4
Start Key4
End Key4
C1(Min, Max)
….
C7(Min, Max)
C1(Min,Max)
…
C7(Min,Max)
C1(Min,Max)
…
C7(Min,Max)
C1(Min,Max)
…
C7(Min,Max)
C1(Min,Max)
…
C7(Min,Max)

Rich Multi-Level Index Support
Spark Driver
Executor
Carbon File
Data
Footer
Carbon File
Data
Footer
Carbon File
Data
Footer
File Level Index
& Scanner
Table Level Index
Executor
File Level Index
& Scanner
Catalyst
Inverted
Index
•Using the index info in footer, two
level B+ tree index can be built:
•File level index: local B+ tree, efficient
blocklet level filtering
•Table level index: global B+ tree,
efficient file level filtering
•Column chunk inverted index:
efficient column chunk scan

CarbonData Table:
Data Segment + Metadata

CarbonData Table
Carbon
File
Metadata:
Appendable Dictionary, Schema
Index:
Separated Index per Segment
Data:
Immutable Data File
Data
source
Load in batch
Segment
CarbonData Table
Read

CarbonData Table
Carbon
File
Carbon
File
Dictionary
(append)
Index
(new tree)
Data File
(new folder)
Data
source
Load in batch
Segment 2 Segment 1
CarbonData Table
Read

CarbonData Table Layout
Carbon File
Data
Footer
Carbon File
Data
Footer
Carbon File
Data
Footer
Carbon File
Data
Footer
Dictionary
File
Dictionary
Map
Spark Driver
Table Level Index
Index File
All Footer
Schema
File
Latest
Schema
/table_name/fact/segment_id /table_name/metaHDFS
Spark

Mutation: Delete
Delete
Base
File
Segment
CarbonData Table
Delete
Delta
Bitmap file that mark
deleted rows
No change
in index
No change
in dictionary

Mutation: Update
Update
Base
File
Segment
CarbonData Table
Delete
Delta
Insert
Delta
Bitmap file for delete
delta and regular data
file for inserted delta
New index added for
inserted delta
Dictionary append

Data Compaction
Carbon
File
Carbon
File
Carbon
File
Dictionary
Index
Data File
Segment 2 Segment 1 Segment 1.1
CarbonData Table
Read

Segment Management
Data segment
…
Load
(write)
Leveraging ZooKeeper to manage the Segment State
Segment
Manager
(ZK based)
Carbon
File
Data segment
Carbon
File
Data segment
Carbon
File
Query
(read)

SparkSQL + CarbonData:
Enables fast interactive data analysis

Carbon-Spark Integration
• Built-in Spark integration
• Spark 1.5, 1.6, 2.1
• Interface
• SQL
• DataFrame API
• Operation:
• Load, Query (with optimization)
• Update, Delete, Compaction, etc
Reader/Writer
Data
Management
Query
Optimization
Carbon File Carbon File Carbon File

Integration through File Format
Carbon
File
Carbon
File
Carbon
File
HDFS
Segment 3 Segment 2 Segment 1
InputFormat/OutputFormat

Deep Integration with SparkSQL
Carbon
File
Carbon
File
Carbon
File
Integrate
SparkSQL
Optimizer
HDFS
Segment 3 Segment 2 Segment 1
CarbonData Table Read/Write/Update/Delete

CarbonData as a SparkSQL Data Source
Parser/Analyzer Optimizer Execution
SQL or
DataFrame
ResolveRelation
Carbon-specific optimization rule:
• Lazy Decode leveraging
global dictionary
Rule-based
New SQL syntax
• DML related statement
RDD
CarbonScanRDD:
• Leveraging multi level index for
efficient filtering and scan
DML related RDDs
Parser
SparkSQL Catalyst Framework
Cost-based
Physical Planning
Strategy
Carbon Data Source
Spark Core

Spark Executor
Spark Driver
Blocklet
Efficient Filtering via Index
HDFS
File
Footer
Blocklet
Blocklet
…
…
C1 C2 C3 C4 Cn
1. File pruning
File
Blocklet
Blocklet
Footer
File
Footer
Blocklet
Blocklet
File
Blocklet
Blocklet
Footer
Blocklet Blocklet Blocklet Blocklet
Task Task
2. Blocklet
pruning
3. Read and decompress filter column
4. Binary search using inverted index,
skip to next blocklet if no matching
…
5. Decompress projection column
6. Return decoded data to spark
SELECT c3, c4 FROM t1 WHERE c2=’boston’

Stage 2
Stage 1
Stage 2
Stage 1
Lazy Decoding by Leveraging Global Dictionary
Final Aggregation
Partial Aggregation
Scan with Filter
(decode c1, c2, c3)
Final Aggregation
Partial Aggregation
Scan with Filter
(decode c1, c2 only)
Dictionary Decode
Before applying Lazy Decode After applying Lazy Decode
SELECT c3, sum(c2) FROM t1 WHERE c1>10 GROUP BY c3
Aggregate on
encoded
value of c3
Decode c3 to
string type

Usage: Write
• Using SQL
• Using Dataframe
df.write
.format(“carbondata")
.options("tableName“, “t1"))
.mode(SaveMode.Overwrite)
.save()
CREATE TABLE tablename (name String, PhoneNumber String) STORED BY “carbondata”
LOAD DATA [LOCAL] INPATH 'folder path' [OVERWRITE] INTO TABLE tablename
OPTIONS(property_name=property_value, ...)
INSERT INTO TABLE tablennme select_statement1 FROM table1;

Usage: Read
• Using SQL
• Using Dataframe
SELECT project_list FROM t1
WHERE cond_list
GROUP BY columns
ORDER BY columns
df = sparkSession.read
.format(“carbondata”)
.option(“tableName”, “t1”)
.load(“path_to_carbon_file”)
df.select(…).show

Usage: Update and Delete
UPDATE table1 A
SET (A.PRODUCT, A.REVENUE) =
(
SELECT PRODUCT, REVENUE
FROM table2 B
WHERE B.CITY = A.CITY AND B.BROKER = A.BROKER
)
WHERE A.DATE BETWEEN ‘2017-01-01’ AND ‘2017-01-31’
table1 table2
UPDATE table1 A
SET A.REVENUE = A.REVENUE - 10
WHERE A.PRODUCT = ‘phone’
DELETE FROM table1 A
WHERE A.CUSTOMERID = ‘123’
123,abc
456,jkd
phone, 70 60
car,100
phone, 30 20
Modify one column in table1
Modify two columns in table1 with values from table2
Delete records in table1

Test
1. TPC-H benchmark (500GB)
2. Test on Production Data Set (Billions of rows)
3. Test on Large Data Set for Scalability (1000B rows, 103TB)
• Storage
– Parquet:
• Partitioned by time column (c1)
– CarbonData:
• Multi-dimensional Index by c1~c10
• Compute
– Spark 2.1

TPC-H: Query
For big scan queries, similar performance,±20%
0
100
200
300
400
500
600
700
800
Response Time (sec)
Parquet
CarbonData

TPC-H: Query
138
283
173
1910
73
207
41 82
0
500
1000
1500
2000
2500
Query_1 Query_3 Query_12 Query_16
Response Time (sec)
Parquet
CarbonData
For queries include filter on fact table,
CarbonData get 1.5-20X performance by leveraging index

TPC-H: Loading and Compression
83.27
44.30
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
Parquet CarbonData
loading throughput (MB/sec/node)
2.96
2.74
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
Parquet CarbonData
compression ratio

Test on Production Data Set
Filter Response Time (sec) Number of Task
Query c1 c2 c3 c4 c5 c6 … c10 Parquet CarbonData Parquet CarbonData
Q1 6.4 1.3 55 5
Q2
65 1.3 804 5
Q3
71 5.2 804 9
Q5
64 4.7 804 9
Q4
67 2.7 804 161
Q6
62 3.7 804 161
Q7
63 21.85 804 588
Q8
69 11.2 804 645
Filter Query
Observation:
Less scan task (resource) is needed because of more efficient filtering by leveraging multi-level
index

Test on Production Data Set
Aggregation Query: no filter, group by c5 (dictionary encoded column)
0
2
4
6
8
10
12
14
16
Parquet CarbonData
Response Time (sec)
0
2
4
6
8
10
12
14
stage 1 stage 2
Stage Execution Time (sec)
Parquet CarbonData
Observation: both partial aggregation and final aggregation are faster,
because aggregation operates on dictionary encoded value

Test on large data set
Observation
When data increase:
l Index is efficient to reduce response time for IO bound query: Q1, Q2
l Spark can scale linearly for CPU bound query: Q3
Data: 200 to 1000 Billion Rows (half year of
telecom data in one china province)
Cluster: 70 nodes, 1120 cores
Query:
Q1: filter (c1~c4), select *
Q2: filter (c10), select *
Q3: full scan aggregate
0.00%
100.00%
200.00%
300.00%
400.00%
500.00%
200B 400B 600B 800B 1000B
Response Time (% of 200B)
Q1 Q2 Q3

What’s coming next
l Enhancement on data loading & compression
l Streaming Ingest:
• Introduce row-based format for fast ingestion
• Gradually compact row-based to column-based for analytic workload
• Optimization for time series data
l Broader Integration across big data ecosystem: Beam, Flink, Kafka, Kylin

Apache
• Love feedbacks, try out, any kind of contribution!
• Code: https://github.com/apache/incubator-carbondata
• JIRA: https://issues.apache.org/jira/browse/CARBONDATA
• dev mailing list: dev@carbondata.incubator.apache.org
• Website: http://http://carbondata.incubator.apache.org
• Current Contributor: 64
• Monthly resolved JIRA issue: 100+

Thank You.
Jihong.Ma@huawei.com
Jacky.likun@huawei.com

Apache Carbondata: An Indexed Columnar File Format for Interactive Query with Spark SQL: Spark Summit East talk by Jacky Li and Jihong Ma

More Related Content

What's hot (20)

Similar to Apache Carbondata: An Indexed Columnar File Format for Interactive Query with Spark SQL: Spark Summit East talk by Jacky Li and Jihong Ma (20)

More from Spark Summit (20)

Recently uploaded (20)

Apache Carbondata: An Indexed Columnar File Format for Interactive Query with Spark SQL: Spark Summit East talk by Jacky Li and Jihong Ma