Scaling sql server 2014 parallel insert

About me
An independent SQL Consultant
A user of SQL Server from version 2000 onwards with 12+ years
experience.
I have a passion for understanding how the database engine works
at a deep level.

A Brief History Of Parallelism In SQL Server
SQL Server Version Feature Introduced
7 Parallelism
2000  Integrated parallel costing mode
Parallel index creation
2005 Partitioning introduced a new form of partitioned
source data.
2008  Partition Table Parallelism, threads assigned to
partitions in round robin fashion.
 Star Join optimisation.
 Few outer rows optimisation.
2012 Batch mode.
2014 Parallel insert via SELECT INTO

The Aim, To Push The Test Hardware To Its Limits
ioDrive2 DUO 2.4 Tb
32 Gb triple channel 1600 MHz DDR3
SanDisk Extreme Pro 480Gb x 2
CPU
2 x 6 core
2.4 Ghz
(Westmere)

Pushing The Parallel Insert To Its Limits, Two Parts
1 st part, optimise scan
of source table

Obtaining An ETW Trace Stack Walking The Database Engine
xperf –on base –stackwalk profile
xperf –d stackwalk.etl
WPA
SQL Statement

Demonstration #1
Analysing IO Performance
With Windows
Performance Tool Kit

Basic Heap Scan IO Throughput and CPU Utilisation
Elapsed time 80 seconds

Optimising Serial Scan Performance: Hash Partition Source Table
 Each heap object has a single parallel page
supplier, it acts like a card dealer.
 When multiple child threads need to
access a single “Card dealer”, this has to be
serialised.
Tuning 101: If there is contention for a
resource, create more of it.

Scan Of Hash Partitioned Heap ( 24 Partitions )
Elapsed time 104 seconds PERFORMANCE HAS GONE BACKWARDS !!!!

Scan Of Hash Partitioned Heap, Where Is The CPU Time Going ?
Why is
COUNT
interested in
values ?

Scan Of Hash Partitioned Heap With NOT NULL Constraints
Elapsed time 56s, IO throughput up by 1 ~ 1.5Gb/s, CPU consumption down !

The Old World Of Optimising Spinning Disk IO Throughput
 Encourage aggressive read ahead behaviour.
 Minimise internal and external fragmentation.
 Use compression to bridge gap between CPU
and storage performance.
 Achieve balanced hardware configurations
from spinning disk to CPU core.

What IO Sizes Are We Getting At Present ?
IO Drive => mostly 64K SanDisk SSD => 64 ~ 512K

Which Striping Scheme Delivers The Best IO Throughput ?
All partitions striped access 7 files on
the ioDrive to every file on the SSD
Each partition heap object stored in
its own file group

Does The New ‘Striping’ Scheme Make Much Difference ?
26 seconds down from 56 seconds !!!

Have The IO Sizes Changed ?
more 512K reads on the SanDisk SSD

“Comparative Analysis” of SSD Driver Stack Vs Fusion IO Virtual Storage Layer

Putting Everything Together, How Well Does The Parallel Insert Scale ?
2 4 6 8 10 12 14 16 18 20 22 24
Elapsed Time (ms) 975072 636360 410178 335997 276996 279277 272656 268998 224117 230745 262889 253508
CPU (ms) 1829999 2338689 2263438 2381466 2422876 2665219 2616095 2734046 2891687 3012737 2928376 3020966
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
0
200000
400000
600000
800000
1000000
1200000
CPUTime(ms)
Elapsedtime(ms)
Degree of Parallelism
Elapsed and CPU time / DOP
Elapsed Time (ms) CPU (ms)
Would forcing the write
IO size help improve
performance ?

Using The E Startup Flag To Force 64 Extent Allocations At A Time
2 4 6 8 10 12 14 16 18 20 22 24
Elapsed Time (ms) 959284 624421 486307 340109 274852 281446 269782 269906 235444 225184 223902 248712
CPU (ms) 1758547 2298236 2544436 2300767 2413845 2644450 2589904 2955548 2905218 2969639 3177187 3048611
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
0
200000
400000
600000
800000
1000000
1200000
CPUTime(ms)
Elapsedtime(ms)
Wait Type Pct
SOS_SCHEDULER_YIELD 43
PAGEIOLATCH_SH 38
LATCH_EX 19

Should We Be Worried About The Spin lock Activity ?
240,000,000
CPU cycles per
second
x 18 = 967,680,000,000 CPU cycles
141,328,383
spins for X_PACKET_LIST
224 seconds x
Spins we
are seeing are a drop in the ocean
compared to expended CPU cycles
answer is NO!

Should We Be Worried About The Spin lock Activity ?
With a DOP of 18 and hyper threading
Best case scenario is for 9 cores to be used:
9 ( physical cores )
x 2400,000,000 ( CPU cycles in 1 second )
x 224 ( duration of the insert )
= 4,838,400,000,000 CPU cycles
141,328,383 X_PACKET_LIST spins
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
NUMA
node 0
NUMA
node 1
CPU cycles 4 orders of
magnitude greater than
X_PACKET_LIST spins
answer is NO!

E Start-up Flag Experiment Has Sent Performance Backwards !!!
 Baseline result with 48 files in destination
file group:
 224117 ms elapsed time at DOP 18
 With the E flag the best result is:
 223902 ms elapsed time at DOP 22 
What happens if we add more files
to the destination file group ? . . .

Test Results With 96 Destination File Group Files And The E Flag
2 4 6 8 10 12 14 16 18 20 22 24
Elapsed Time (ms) 1072405 493269 476260 397712 278963 278660 224413 267579 272631 259809 266381 258877
CPU (ms) 1834812 1887734 2577092 2669795 2384484 2866142 2527358 2810190 2943342 2783533 3126454 3006719
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
0
200000
400000
600000
800000
1000000
1200000
CPUTime(ms)
Elapsedtime(ms)
Wait Type Pct
SOS_SCHEDULER_YIELD 57
LATCH_EX 25
PAGEIOLATCH_SH 9
PAGEIOLATCH_UP 5
ASYNCH_NETWORK_IO 4

What Is The NESTING_TRANSACTION_FULL Latch ?
 Controls access to the transaction description structures (XDES).
 XDES is part of the run time ( sqlmin.dll ) and used to generate
transaction logging information before it is copied to the log buffer.
 _FULL is for active transactions.
 A parallel query must start a sub-transaction for each thread, these
transactions are sub-transactions of the parallel nested transaction.
 This information comes from the SQL Skills blog.

The Problem With Wait Statistics
Easy to see what is happening in the
suspended queue.
The runnable queue only gives us time
accrued by SQL OS scheduler yields.
The view
provided by
conventional
tools
Where we want
greater insight

Layered Architecture Of The Database Engine From SQL 2012 Onwards
Language Processing – SQLLANG.dll
Database Engine Runtime
SQLMIN.dll
Storage engine and
execution engine
SQLTSES.dll
SQL expression service
QDS.dll
Query data store (SQL 2014+)
SQL OS
SQLDK.dll SQLOS.dll

Demonstration #2
Stack Walking
The Database Engine

Where Is Our CPU Time Going ?
Call Stack Weight
ntdll.dll!RtlUserThreadStart 3031180
.
.
SQL OS activity
.
.
sqlmin.dll!CQScanUpdateNew::GetRow 2833949
sqlmin.dll!CQScanTableScanNew::GetRow 1514872
sqlmin.dll!CXRowset::FetchNextRow 1469327
sqlmin.dll!RowsetNewSS::FetchNextRow 1453029
sqlmin.dll!DatasetSession::GetNextRowValuesNoLock 1400805
sqlmin.dll!HeapDataSetSession::GetNextRowValuesInternal 1360920
sqlmin.dll!DataAccessWrapper::StoreColumnValue 477217
sqlmin.dll!DataAccessWrapper::DecompressColumnValue 344113
sqlmin.dll!DataAccessWrapper::DecompressColumnValue<itself> 188263
sqltses.dll!UnicodeCompressor::Decompress 135764
sqlmin.dll!__security_check_cookie<itself> 19698
sqlmin.dll!DataAccessWrapper::StoreColumnValue<itself> 132838
sqltses.dll!CEsExec::GeneralEval 1166920
sqlmin.dll!CValRow::SetDataX 986270
sqlmin.dll!RowsetBulk::InsertRow 975735

What Does The Call Stack Tell Us ?
 32% of the CPU time is consumed by the insert part of the statement.
 There is a 4% CPU overhead when dealing with unicode.
 With a 96 file destination file group and the E startup flag in use,
inspecting column values still accounts for 46% of the total CPU time
expended.

What Are We Waiting On ?*
*96 file is dest. File
group and E flag in use
42 4 41
4
4 5
5
7 5
10
43 3
4
5
3
7
4
9 9
5
7
6
2 3
5
5
15
18
25
4
33
50
39
18
95 94
88
75 77
65
57
89
58
38
45
78
0
10
20
30
40
50
60
70
80
90
100
2 4 6 8 10 12 14 16 18 20 22 24
PERCENTAGEWAITTIME
DOP
SOS_SCHEDULER_YIELD
LATCH_EX
PAGEIOLATCH_SH
PAGEIOLATCH_EX
PAGEIOLATCH_UP
WRITELOG
ASYNCH_NETWORK_IO

What Have We Learned: Scan Rates
 A parallel heap scan is throttled by latching around access to the page
range supplier:
 Solution: hash partition the table.
 There is significant cost in “Cracking open” columns to inspect their
values during a scan, 46% of CPU time approximately.
 The practice of trying to get the best throughput by obtaining 512K reads
applies to flash storage with traditional interfaces ( SATA / SAS ), but not
so much ioDrives.

What Have We Learned: Parallel Insert
 In the best case scenario, the parallel insert is scalable up to a DOP of 14.
 IO related waits play a very minor part in overall wait time as the degree
of parallelism is increased.
 The destination file group has a “number of files sweet spot” which
relates to latency spikes (flushes to disk) on its member files.
 A combination of the destination file group file number sweet spot and
the use of E start-up flag yields the lowest elapsed time at a DOP of 14.

Scaling sql server 2014 parallel insert

chris1adkin@yahoo.co.uk
http://uk.linkedin.com/in/wollatondba
ChrisAdkin8

Scaling sql server 2014 parallel insert

More Related Content

What's hot (19)

Similar to Scaling sql server 2014 parallel insert (20)

More from Chris Adkin (9)

Recently uploaded (20)

Scaling sql server 2014 parallel insert

Editor's Notes