SlideShare a Scribd company logo
DBA Level 400
About me
ļ‚§An independent SQL Consultant
ļ‚§A user of SQL Server from version 2000 onwards with 12+ years
experience.
ļ‚§I have a passion for understanding how the database engine works
at a deep level.
A Brief History Of Parallelism In SQL Server
SQL Server Version Feature Introduced
7 Parallelism
2000 ļ‚§ Integrated parallel costing mode
Parallel index creation
2005 Partitioning introduced a new form of partitioned
source data.
2008 ļ‚§ Partition Table Parallelism, threads assigned to
partitions in round robin fashion.
ļ‚§ Star Join optimisation.
ļ‚§ Few outer rows optimisation.
2012 Batch mode.
2014 Parallel insert via SELECT INTO
The Aim, To Push The Test Hardware To Its Limits
ioDrive2 DUO 2.4 Tb
32 Gb triple channel 1600 MHz DDR3
SanDisk Extreme Pro 480Gb x 2
CPU
2 x 6 core
2.4 Ghz
(Westmere)
Pushing The Parallel Insert To Its Limits, Two Parts
1 st part, optimise scan
of source table
Obtaining An ETW Trace Stack Walking The Database Engine
xperf –on base –stackwalk profile
xperf –d stackwalk.etl
WPA
SQL Statement
Demonstration #1
Analysing IO Performance
With Windows
Performance Tool Kit
Basic Heap Scan IO Throughput and CPU Utilisation
Elapsed time 80 seconds
Where Is The Bottleneck ?
Optimising Serial Scan Performance: Hash Partition Source Table
ļ‚§ Each heap object has a single parallel page
supplier, it acts like a card dealer.
ļ‚§ When multiple child threads need to
access a single ā€œCard dealerā€, this has to be
serialised.
Tuning 101: If there is contention for a
resource, create more of it.
Scan Of Hash Partitioned Heap ( 24 Partitions )
Elapsed time 104 seconds PERFORMANCE HAS GONE BACKWARDS !!!!
Scan Of Hash Partitioned Heap, Where Is The CPU Time Going ?
Why is
COUNT
interested in
values ?
Scan Of Hash Partitioned Heap With NOT NULL Constraints
Elapsed time 56s, IO throughput up by 1 ~ 1.5Gb/s, CPU consumption down !
The Old World Of Optimising Spinning Disk IO Throughput
ļ‚§ Encourage aggressive read ahead behaviour.
ļ‚§ Minimise internal and external fragmentation.
ļ‚§ Use compression to bridge gap between CPU
and storage performance.
ļ‚§ Achieve balanced hardware configurations
from spinning disk to CPU core.
What IO Sizes Are We Getting At Present ?
IO Drive => mostly 64K SanDisk SSD => 64 ~ 512K
Which Striping Scheme Delivers The Best IO Throughput ?
All partitions striped access 7 files on
the ioDrive to every file on the SSD
Each partition heap object stored in
its own file group
Does The New ā€˜Striping’ Scheme Make Much Difference ?
26 seconds down from 56 seconds !!!
Have The IO Sizes Changed ?
more 512K reads on the SanDisk SSD
ā€œComparative Analysisā€ of SSD Driver Stack Vs Fusion IO Virtual Storage Layer
Putting Everything Together, How Well Does The Parallel Insert Scale ?
2 4 6 8 10 12 14 16 18 20 22 24
Elapsed Time (ms) 975072 636360 410178 335997 276996 279277 272656 268998 224117 230745 262889 253508
CPU (ms) 1829999 2338689 2263438 2381466 2422876 2665219 2616095 2734046 2891687 3012737 2928376 3020966
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
0
200000
400000
600000
800000
1000000
1200000
CPUTime(ms)
Elapsedtime(ms)
Degree of Parallelism
Elapsed and CPU time / DOP
Elapsed Time (ms) CPU (ms)
Would forcing the write
IO size help improve
performance ?
Using The E Startup Flag To Force 64 Extent Allocations At A Time
2 4 6 8 10 12 14 16 18 20 22 24
Elapsed Time (ms) 959284 624421 486307 340109 274852 281446 269782 269906 235444 225184 223902 248712
CPU (ms) 1758547 2298236 2544436 2300767 2413845 2644450 2589904 2955548 2905218 2969639 3177187 3048611
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
0
200000
400000
600000
800000
1000000
1200000
CPUTime(ms)
Elapsedtime(ms)
Degree of Parallelism
Elapsed and CPU time / DOP
Elapsed Time (ms) CPU (ms)
Wait Type Pct
SOS_SCHEDULER_YIELD 43
PAGEIOLATCH_SH 38
LATCH_EX 19
Should We Be Worried About The Spin lock Activity ?
240,000,000
CPU cycles per
second
x 18 = 967,680,000,000 CPU cycles
141,328,383
spins for X_PACKET_LIST
224 seconds x
Spins we
are seeing are a drop in the ocean
compared to expended CPU cycles
answer is NO!
Should We Be Worried About The Spin lock Activity ?
With a DOP of 18 and hyper threading
Best case scenario is for 9 cores to be used:
9 ( physical cores )
x 2400,000,000 ( CPU cycles in 1 second )
x 224 ( duration of the insert )
= 4,838,400,000,000 CPU cycles
141,328,383 X_PACKET_LIST spins
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
Hyper-thread
NUMA
node 0
NUMA
node 1
CPU cycles 4 orders of
magnitude greater than
X_PACKET_LIST spins
answer is NO!
E Start-up Flag Experiment Has Sent Performance Backwards !!!
ļ‚§ Baseline result with 48 files in destination
file group:
ļ‚§ 224117 ms elapsed time at DOP 18
ļ‚§ With the E flag the best result is:
ļ‚§ 223902 ms elapsed time at DOP 22 
What happens if we add more files
to the destination file group ? . . .
Test Results With 96 Destination File Group Files And The E Flag
2 4 6 8 10 12 14 16 18 20 22 24
Elapsed Time (ms) 1072405 493269 476260 397712 278963 278660 224413 267579 272631 259809 266381 258877
CPU (ms) 1834812 1887734 2577092 2669795 2384484 2866142 2527358 2810190 2943342 2783533 3126454 3006719
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
0
200000
400000
600000
800000
1000000
1200000
CPUTime(ms)
Elapsedtime(ms)
Degree of Parallelism
Elapsed and CPU time / DOP
Elapsed Time (ms) CPU (ms)
Wait Type Pct
SOS_SCHEDULER_YIELD 57
LATCH_EX 25
PAGEIOLATCH_SH 9
PAGEIOLATCH_UP 5
ASYNCH_NETWORK_IO 4
What Is The NESTING_TRANSACTION_FULL Latch ?
ļ‚§ Controls access to the transaction description structures (XDES).
ļ‚§ XDES is part of the run time ( sqlmin.dll ) and used to generate
transaction logging information before it is copied to the log buffer.
ļ‚§ _FULL is for active transactions.
ļ‚§ A parallel query must start a sub-transaction for each thread, these
transactions are sub-transactions of the parallel nested transaction.
ļ‚§ This information comes from the SQL Skills blog.
The Problem With Wait Statistics
Easy to see what is happening in the
suspended queue.
The runnable queue only gives us time
accrued by SQL OS scheduler yields.
The view
provided by
conventional
tools
Where we want
greater insight
Layered Architecture Of The Database Engine From SQL 2012 Onwards
Language Processing – SQLLANG.dll
Database Engine Runtime
SQLMIN.dll
Storage engine and
execution engine
SQLTSES.dll
SQL expression service
QDS.dll
Query data store (SQL 2014+)
SQL OS
SQLDK.dll SQLOS.dll
Demonstration #2
Stack Walking
The Database Engine
Where Is Our CPU Time Going ?
Call Stack Weight
ntdll.dll!RtlUserThreadStart 3031180
.
.
SQL OS activity
.
.
sqlmin.dll!CQScanUpdateNew::GetRow 2833949
sqlmin.dll!CQScanTableScanNew::GetRow 1514872
sqlmin.dll!CXRowset::FetchNextRow 1469327
sqlmin.dll!RowsetNewSS::FetchNextRow 1453029
sqlmin.dll!DatasetSession::GetNextRowValuesNoLock 1400805
sqlmin.dll!HeapDataSetSession::GetNextRowValuesInternal 1360920
sqlmin.dll!DataAccessWrapper::StoreColumnValue 477217
sqlmin.dll!DataAccessWrapper::DecompressColumnValue 344113
sqlmin.dll!DataAccessWrapper::DecompressColumnValue<itself> 188263
sqltses.dll!UnicodeCompressor::Decompress 135764
sqlmin.dll!__security_check_cookie<itself> 19698
sqlmin.dll!DataAccessWrapper::StoreColumnValue<itself> 132838
sqltses.dll!CEsExec::GeneralEval 1166920
sqlmin.dll!CValRow::SetDataX 986270
sqlmin.dll!RowsetBulk::InsertRow 975735
What Does The Call Stack Tell Us ?
ļ‚§ 32% of the CPU time is consumed by the insert part of the statement.
ļ‚§ There is a 4% CPU overhead when dealing with unicode.
ļ‚§ With a 96 file destination file group and the E startup flag in use,
inspecting column values still accounts for 46% of the total CPU time
expended.
What Are We Waiting On ?*
*96 file is dest. File
group and E flag in use
42 4 41
4
4 5
5
7 5
10
43 3
4
5
3
7
4
9 9
5
7
6
2 3
5
5
15
18
25
4
33
50
39
18
95 94
88
75 77
65
57
89
58
38
45
78
0
10
20
30
40
50
60
70
80
90
100
2 4 6 8 10 12 14 16 18 20 22 24
PERCENTAGEWAITTIME
DOP
SOS_SCHEDULER_YIELD
LATCH_EX
PAGEIOLATCH_SH
PAGEIOLATCH_EX
PAGEIOLATCH_UP
WRITELOG
ASYNCH_NETWORK_IO
What Have We Learned: Scan Rates
ļ‚§ A parallel heap scan is throttled by latching around access to the page
range supplier:
ļ‚§ Solution: hash partition the table.
ļ‚§ There is significant cost in ā€œCracking openā€ columns to inspect their
values during a scan, 46% of CPU time approximately.
ļ‚§ The practice of trying to get the best throughput by obtaining 512K reads
applies to flash storage with traditional interfaces ( SATA / SAS ), but not
so much ioDrives.
What Have We Learned: Parallel Insert
ļ‚§ In the best case scenario, the parallel insert is scalable up to a DOP of 14.
ļ‚§ IO related waits play a very minor part in overall wait time as the degree
of parallelism is increased.
ļ‚§ The destination file group has a ā€œnumber of files sweet spotā€ which
relates to latency spikes (flushes to disk) on its member files.
ļ‚§ A combination of the destination file group file number sweet spot and
the use of E start-up flag yields the lowest elapsed time at a DOP of 14.
Scaling sql server 2014 parallel insert
chris1adkin@yahoo.co.uk
http://uk.linkedin.com/in/wollatondba
ChrisAdkin8

More Related Content

What's hot (19)

PPTX
Sql sever engine batch mode and cpu architectures
Chris Adkin
Ā 
PDF
PostgreSQL and RAM usage
Alexey Bashtanov
Ā 
PDF
PostgreSQL WAL for DBAs
PGConf APAC
Ā 
PDF
Troubleshooting Complex Performance issues - Oracle SEG$ contention
Tanel Poder
Ā 
PPT
15 Ways to Kill Your Mysql Application Performance
guest9912e5
Ā 
PDF
Oracle in-Memory Column Store for BI
Franck Pachot
Ā 
PDF
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
Tanel Poder
Ā 
PDF
Webinar slides: The Holy Grail Webinar: Become a MySQL DBA - Database Perform...
Severalnines
Ā 
PDF
In Memory Database In Action by Tanel Poder and Kerry Osborne
Enkitec
Ā 
PDF
Really Big Elephants: PostgreSQL DW
PostgreSQL Experts, Inc.
Ā 
PDF
Using Apache Spark and MySQL for Data Analysis
Sveta Smirnova
Ā 
PDF
Testing Delphix: easy data virtualization
Franck Pachot
Ā 
PDF
Dbvisit replicate: logical replication made easy
Franck Pachot
Ā 
PDF
Tanel Poder - Performance stories from Exadata Migrations
Tanel Poder
Ā 
PDF
Profiling the logwriter and database writer
Kyle Hailey
Ā 
PDF
Как PostgreSQL работает с Гиском
PostgreSQL-Consulting
Ā 
PDF
PostgreSQL 9.6 Performance-Scalability Improvements
PGConf APAC
Ā 
PPTX
Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...
BertrandDrouvot
Ā 
PDF
Managing terabytes: When PostgreSQL gets big
Selena Deckelmann
Ā 
Sql sever engine batch mode and cpu architectures
Chris Adkin
Ā 
PostgreSQL and RAM usage
Alexey Bashtanov
Ā 
PostgreSQL WAL for DBAs
PGConf APAC
Ā 
Troubleshooting Complex Performance issues - Oracle SEG$ contention
Tanel Poder
Ā 
15 Ways to Kill Your Mysql Application Performance
guest9912e5
Ā 
Oracle in-Memory Column Store for BI
Franck Pachot
Ā 
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
Tanel Poder
Ā 
Webinar slides: The Holy Grail Webinar: Become a MySQL DBA - Database Perform...
Severalnines
Ā 
In Memory Database In Action by Tanel Poder and Kerry Osborne
Enkitec
Ā 
Really Big Elephants: PostgreSQL DW
PostgreSQL Experts, Inc.
Ā 
Using Apache Spark and MySQL for Data Analysis
Sveta Smirnova
Ā 
Testing Delphix: easy data virtualization
Franck Pachot
Ā 
Dbvisit replicate: logical replication made easy
Franck Pachot
Ā 
Tanel Poder - Performance stories from Exadata Migrations
Tanel Poder
Ā 
Profiling the logwriter and database writer
Kyle Hailey
Ā 
Как PostgreSQL работает с Гиском
PostgreSQL-Consulting
Ā 
PostgreSQL 9.6 Performance-Scalability Improvements
PGConf APAC
Ā 
Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...
BertrandDrouvot
Ā 
Managing terabytes: When PostgreSQL gets big
Selena Deckelmann
Ā 

Similar to Scaling sql server 2014 parallel insert (20)

PPTX
SQL Server It Just Runs Faster
Bob Ward
Ā 
PPTX
Sql server 2016 it just runs faster sql bits 2017 edition
Bob Ward
Ā 
PPTX
Proving out flash storage array performance using swingbench and slob
Kapil Goyal
Ā 
PPT
download it from here
webhostingguy
Ā 
PPT
Troubleshooting SQL Server
Stephen Rose
Ā 
PPTX
Full scan frenzy at amadeus
MongoDB
Ā 
PDF
YOW2020 Linux Systems Performance
Brendan Gregg
Ā 
PPT
Best Practices for performance evaluation and diagnosis of Java Applications ...
IndicThreads
Ā 
PDF
Fundamentals of Physical Memory Analysis
Dmitry Vostokov
Ā 
PDF
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
In-Memory Computing Summit
Ā 
PDF
Accelerating EDA workloads on Azure – Best Practice and benchmark on Intel EM...
Meng-Ru (Raymond) Tsai
Ā 
PPTX
Webinar: Untethering Compute from Storage
Avere Systems
Ā 
PDF
Analyzing OS X Systems Performance with the USE Method
Brendan Gregg
Ā 
PPTX
IO Dubi Lebel
sqlserver.co.il
Ā 
PDF
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Fisnik Kraja
Ā 
PDF
Analyzing and Interpreting AWR
pasalapudi
Ā 
PDF
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
Ā 
PDF
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Danielle Womboldt
Ā 
PDF
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Community
Ā 
PDF
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Altinity Ltd
Ā 
SQL Server It Just Runs Faster
Bob Ward
Ā 
Sql server 2016 it just runs faster sql bits 2017 edition
Bob Ward
Ā 
Proving out flash storage array performance using swingbench and slob
Kapil Goyal
Ā 
download it from here
webhostingguy
Ā 
Troubleshooting SQL Server
Stephen Rose
Ā 
Full scan frenzy at amadeus
MongoDB
Ā 
YOW2020 Linux Systems Performance
Brendan Gregg
Ā 
Best Practices for performance evaluation and diagnosis of Java Applications ...
IndicThreads
Ā 
Fundamentals of Physical Memory Analysis
Dmitry Vostokov
Ā 
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
In-Memory Computing Summit
Ā 
Accelerating EDA workloads on Azure – Best Practice and benchmark on Intel EM...
Meng-Ru (Raymond) Tsai
Ā 
Webinar: Untethering Compute from Storage
Avere Systems
Ā 
Analyzing OS X Systems Performance with the USE Method
Brendan Gregg
Ā 
IO Dubi Lebel
sqlserver.co.il
Ā 
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Fisnik Kraja
Ā 
Analyzing and Interpreting AWR
pasalapudi
Ā 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
Ā 
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Danielle Womboldt
Ā 
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Community
Ā 
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Altinity Ltd
Ā 
Ad

More from Chris Adkin (9)

PDF
Bdc from bare metal to k8s
Chris Adkin
Ā 
PPTX
Data weekender deploying prod grade sql 2019 big data clusters
Chris Adkin
Ā 
PPTX
Data relay introduction to big data clusters
Chris Adkin
Ā 
PPTX
Ci with jenkins docker and mssql belgium
Chris Adkin
Ā 
PPTX
Continuous Integration With Jenkins Docker SQL Server
Chris Adkin
Ā 
PDF
TSQL Coding Guidelines
Chris Adkin
Ā 
PPT
J2EE Performance And Scalability Bp
Chris Adkin
Ā 
PPT
J2EE Batch Processing
Chris Adkin
Ā 
PPT
Oracle Sql Tuning
Chris Adkin
Ā 
Bdc from bare metal to k8s
Chris Adkin
Ā 
Data weekender deploying prod grade sql 2019 big data clusters
Chris Adkin
Ā 
Data relay introduction to big data clusters
Chris Adkin
Ā 
Ci with jenkins docker and mssql belgium
Chris Adkin
Ā 
Continuous Integration With Jenkins Docker SQL Server
Chris Adkin
Ā 
TSQL Coding Guidelines
Chris Adkin
Ā 
J2EE Performance And Scalability Bp
Chris Adkin
Ā 
J2EE Batch Processing
Chris Adkin
Ā 
Oracle Sql Tuning
Chris Adkin
Ā 
Ad

Recently uploaded (20)

PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
Ā 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
Ā 
PPTX
What Is Data Integration and Transformation?
subhashenia
Ā 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
Ā 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
Ā 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
Ā 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
Ā 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
Ā 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
Ā 
PPTX
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
Ā 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
Ā 
PPTX
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
Ā 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
Ā 
PPTX
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
Ā 
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
Ā 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
Ā 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
Ā 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
Ā 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
Ā 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
Ā 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
Ā 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
Ā 
What Is Data Integration and Transformation?
subhashenia
Ā 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
Ā 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
Ā 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
Ā 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
Ā 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
Ā 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
Ā 
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
Ā 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
Ā 
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
Ā 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
Ā 
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
Ā 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
Ā 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
Ā 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
Ā 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
Ā 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
Ā 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
Ā 

Scaling sql server 2014 parallel insert

  • 2. About me ļ‚§An independent SQL Consultant ļ‚§A user of SQL Server from version 2000 onwards with 12+ years experience. ļ‚§I have a passion for understanding how the database engine works at a deep level.
  • 3. A Brief History Of Parallelism In SQL Server SQL Server Version Feature Introduced 7 Parallelism 2000 ļ‚§ Integrated parallel costing mode Parallel index creation 2005 Partitioning introduced a new form of partitioned source data. 2008 ļ‚§ Partition Table Parallelism, threads assigned to partitions in round robin fashion. ļ‚§ Star Join optimisation. ļ‚§ Few outer rows optimisation. 2012 Batch mode. 2014 Parallel insert via SELECT INTO
  • 4. The Aim, To Push The Test Hardware To Its Limits ioDrive2 DUO 2.4 Tb 32 Gb triple channel 1600 MHz DDR3 SanDisk Extreme Pro 480Gb x 2 CPU 2 x 6 core 2.4 Ghz (Westmere)
  • 5. Pushing The Parallel Insert To Its Limits, Two Parts 1 st part, optimise scan of source table
  • 6. Obtaining An ETW Trace Stack Walking The Database Engine xperf –on base –stackwalk profile xperf –d stackwalk.etl WPA SQL Statement
  • 7. Demonstration #1 Analysing IO Performance With Windows Performance Tool Kit
  • 8. Basic Heap Scan IO Throughput and CPU Utilisation Elapsed time 80 seconds
  • 9. Where Is The Bottleneck ?
  • 10. Optimising Serial Scan Performance: Hash Partition Source Table ļ‚§ Each heap object has a single parallel page supplier, it acts like a card dealer. ļ‚§ When multiple child threads need to access a single ā€œCard dealerā€, this has to be serialised. Tuning 101: If there is contention for a resource, create more of it.
  • 11. Scan Of Hash Partitioned Heap ( 24 Partitions ) Elapsed time 104 seconds PERFORMANCE HAS GONE BACKWARDS !!!!
  • 12. Scan Of Hash Partitioned Heap, Where Is The CPU Time Going ? Why is COUNT interested in values ?
  • 13. Scan Of Hash Partitioned Heap With NOT NULL Constraints Elapsed time 56s, IO throughput up by 1 ~ 1.5Gb/s, CPU consumption down !
  • 14. The Old World Of Optimising Spinning Disk IO Throughput ļ‚§ Encourage aggressive read ahead behaviour. ļ‚§ Minimise internal and external fragmentation. ļ‚§ Use compression to bridge gap between CPU and storage performance. ļ‚§ Achieve balanced hardware configurations from spinning disk to CPU core.
  • 15. What IO Sizes Are We Getting At Present ? IO Drive => mostly 64K SanDisk SSD => 64 ~ 512K
  • 16. Which Striping Scheme Delivers The Best IO Throughput ? All partitions striped access 7 files on the ioDrive to every file on the SSD Each partition heap object stored in its own file group
  • 17. Does The New ā€˜Striping’ Scheme Make Much Difference ? 26 seconds down from 56 seconds !!!
  • 18. Have The IO Sizes Changed ? more 512K reads on the SanDisk SSD
  • 19. ā€œComparative Analysisā€ of SSD Driver Stack Vs Fusion IO Virtual Storage Layer
  • 20. Putting Everything Together, How Well Does The Parallel Insert Scale ? 2 4 6 8 10 12 14 16 18 20 22 24 Elapsed Time (ms) 975072 636360 410178 335997 276996 279277 272656 268998 224117 230745 262889 253508 CPU (ms) 1829999 2338689 2263438 2381466 2422876 2665219 2616095 2734046 2891687 3012737 2928376 3020966 0 500000 1000000 1500000 2000000 2500000 3000000 3500000 0 200000 400000 600000 800000 1000000 1200000 CPUTime(ms) Elapsedtime(ms) Degree of Parallelism Elapsed and CPU time / DOP Elapsed Time (ms) CPU (ms) Would forcing the write IO size help improve performance ?
  • 21. Using The E Startup Flag To Force 64 Extent Allocations At A Time 2 4 6 8 10 12 14 16 18 20 22 24 Elapsed Time (ms) 959284 624421 486307 340109 274852 281446 269782 269906 235444 225184 223902 248712 CPU (ms) 1758547 2298236 2544436 2300767 2413845 2644450 2589904 2955548 2905218 2969639 3177187 3048611 0 500000 1000000 1500000 2000000 2500000 3000000 3500000 0 200000 400000 600000 800000 1000000 1200000 CPUTime(ms) Elapsedtime(ms) Degree of Parallelism Elapsed and CPU time / DOP Elapsed Time (ms) CPU (ms) Wait Type Pct SOS_SCHEDULER_YIELD 43 PAGEIOLATCH_SH 38 LATCH_EX 19
  • 22. Should We Be Worried About The Spin lock Activity ? 240,000,000 CPU cycles per second x 18 = 967,680,000,000 CPU cycles 141,328,383 spins for X_PACKET_LIST 224 seconds x Spins we are seeing are a drop in the ocean compared to expended CPU cycles answer is NO!
  • 23. Should We Be Worried About The Spin lock Activity ? With a DOP of 18 and hyper threading Best case scenario is for 9 cores to be used: 9 ( physical cores ) x 2400,000,000 ( CPU cycles in 1 second ) x 224 ( duration of the insert ) = 4,838,400,000,000 CPU cycles 141,328,383 X_PACKET_LIST spins Hyper-thread Hyper-thread Hyper-thread Hyper-thread Hyper-thread Hyper-thread Hyper-thread Hyper-thread Hyper-thread Hyper-thread Hyper-thread Hyper-thread Hyper-thread Hyper-thread Hyper-thread Hyper-thread Hyper-thread Hyper-thread Hyper-thread Hyper-thread Hyper-thread Hyper-thread Hyper-thread Hyper-thread NUMA node 0 NUMA node 1 CPU cycles 4 orders of magnitude greater than X_PACKET_LIST spins answer is NO!
  • 24. E Start-up Flag Experiment Has Sent Performance Backwards !!! ļ‚§ Baseline result with 48 files in destination file group: ļ‚§ 224117 ms elapsed time at DOP 18 ļ‚§ With the E flag the best result is: ļ‚§ 223902 ms elapsed time at DOP 22  What happens if we add more files to the destination file group ? . . .
  • 25. Test Results With 96 Destination File Group Files And The E Flag 2 4 6 8 10 12 14 16 18 20 22 24 Elapsed Time (ms) 1072405 493269 476260 397712 278963 278660 224413 267579 272631 259809 266381 258877 CPU (ms) 1834812 1887734 2577092 2669795 2384484 2866142 2527358 2810190 2943342 2783533 3126454 3006719 0 500000 1000000 1500000 2000000 2500000 3000000 3500000 0 200000 400000 600000 800000 1000000 1200000 CPUTime(ms) Elapsedtime(ms) Degree of Parallelism Elapsed and CPU time / DOP Elapsed Time (ms) CPU (ms) Wait Type Pct SOS_SCHEDULER_YIELD 57 LATCH_EX 25 PAGEIOLATCH_SH 9 PAGEIOLATCH_UP 5 ASYNCH_NETWORK_IO 4
  • 26. What Is The NESTING_TRANSACTION_FULL Latch ? ļ‚§ Controls access to the transaction description structures (XDES). ļ‚§ XDES is part of the run time ( sqlmin.dll ) and used to generate transaction logging information before it is copied to the log buffer. ļ‚§ _FULL is for active transactions. ļ‚§ A parallel query must start a sub-transaction for each thread, these transactions are sub-transactions of the parallel nested transaction. ļ‚§ This information comes from the SQL Skills blog.
  • 27. The Problem With Wait Statistics Easy to see what is happening in the suspended queue. The runnable queue only gives us time accrued by SQL OS scheduler yields. The view provided by conventional tools Where we want greater insight
  • 28. Layered Architecture Of The Database Engine From SQL 2012 Onwards Language Processing – SQLLANG.dll Database Engine Runtime SQLMIN.dll Storage engine and execution engine SQLTSES.dll SQL expression service QDS.dll Query data store (SQL 2014+) SQL OS SQLDK.dll SQLOS.dll
  • 30. Where Is Our CPU Time Going ? Call Stack Weight ntdll.dll!RtlUserThreadStart 3031180 . . SQL OS activity . . sqlmin.dll!CQScanUpdateNew::GetRow 2833949 sqlmin.dll!CQScanTableScanNew::GetRow 1514872 sqlmin.dll!CXRowset::FetchNextRow 1469327 sqlmin.dll!RowsetNewSS::FetchNextRow 1453029 sqlmin.dll!DatasetSession::GetNextRowValuesNoLock 1400805 sqlmin.dll!HeapDataSetSession::GetNextRowValuesInternal 1360920 sqlmin.dll!DataAccessWrapper::StoreColumnValue 477217 sqlmin.dll!DataAccessWrapper::DecompressColumnValue 344113 sqlmin.dll!DataAccessWrapper::DecompressColumnValue<itself> 188263 sqltses.dll!UnicodeCompressor::Decompress 135764 sqlmin.dll!__security_check_cookie<itself> 19698 sqlmin.dll!DataAccessWrapper::StoreColumnValue<itself> 132838 sqltses.dll!CEsExec::GeneralEval 1166920 sqlmin.dll!CValRow::SetDataX 986270 sqlmin.dll!RowsetBulk::InsertRow 975735
  • 31. What Does The Call Stack Tell Us ? ļ‚§ 32% of the CPU time is consumed by the insert part of the statement. ļ‚§ There is a 4% CPU overhead when dealing with unicode. ļ‚§ With a 96 file destination file group and the E startup flag in use, inspecting column values still accounts for 46% of the total CPU time expended.
  • 32. What Are We Waiting On ?* *96 file is dest. File group and E flag in use 42 4 41 4 4 5 5 7 5 10 43 3 4 5 3 7 4 9 9 5 7 6 2 3 5 5 15 18 25 4 33 50 39 18 95 94 88 75 77 65 57 89 58 38 45 78 0 10 20 30 40 50 60 70 80 90 100 2 4 6 8 10 12 14 16 18 20 22 24 PERCENTAGEWAITTIME DOP SOS_SCHEDULER_YIELD LATCH_EX PAGEIOLATCH_SH PAGEIOLATCH_EX PAGEIOLATCH_UP WRITELOG ASYNCH_NETWORK_IO
  • 33. What Have We Learned: Scan Rates ļ‚§ A parallel heap scan is throttled by latching around access to the page range supplier: ļ‚§ Solution: hash partition the table. ļ‚§ There is significant cost in ā€œCracking openā€ columns to inspect their values during a scan, 46% of CPU time approximately. ļ‚§ The practice of trying to get the best throughput by obtaining 512K reads applies to flash storage with traditional interfaces ( SATA / SAS ), but not so much ioDrives.
  • 34. What Have We Learned: Parallel Insert ļ‚§ In the best case scenario, the parallel insert is scalable up to a DOP of 14. ļ‚§ IO related waits play a very minor part in overall wait time as the degree of parallelism is increased. ļ‚§ The destination file group has a ā€œnumber of files sweet spotā€ which relates to latency spikes (flushes to disk) on its member files. ļ‚§ A combination of the destination file group file number sweet spot and the use of E start-up flag yields the lowest elapsed time at a DOP of 14.

Editor's Notes

  • #7: Xperf can provide deep insights into the database engine that other tools cannot, in this case we can walk the stack associated with query execution and observe the total CPU consumption up to any point in the stack in milliseconds