SlideShare a Scribd company logo
Capacity Model of an ETL system
Ashok Bhatla
Email – ASHOK.BHATLA.WRITER@GMAIL.COM
What is Business Intelligence?
Business Intelligence (BI) is a combination of tools, processes and
software which help a company to transform data into actionable
knowledge, thereby allowing them to take faster and informed decisions in
order to achieve their strategic goals.
It’s all about providing right information to the management at the right
time with the lowest possible cost.

As we are drowning in data, but
starving for knowledge,
Business Intelligence has
become the No. 1 priority for IT
Managers today.
What is ETL?
ETL stands for Extract, Transform and Load. A transactional system is meant
to be a high performance system so that users can get their work faster.
Running some reports from a Transactional system makes it slower. Therefore,
the concept of ETL gained popularity.

In computing, Extract, Transform, and Load
(ETL) refers to a process in database usage
which involves the following steps
Extracts data from outside sources.
Transforms it to fit operational needs,
which can include joining/reformatting
some tables.
Loads it into the end target (database,
more specifically, operational data
store, data mart, or data warehouse)
Example of ETL
OLTP Systems
Cost
Accounting
System

Payroll
Data
ETL – Joins,
Transforms,
Deletes etc.

Load Data

Sales
Data
Staged
Data
Purchasing
Data

EDW /
Reporting Data
What is Capacity Planning?
 Capacity Planning is the process of identifying the current
computing needs of a business application and to forecast the
future computing needs based on the business plans.
 In other words, it means what computing resources are needed to
meet an application’s service level objectives over a period of time.
 In today’s economic climate, business requirements can change
rapidly depending upon an organization’s strategy and goals.
 Therefore properly managed capacity plans should be able to take
unforeseen requirements into account.
 Capacity Planning can be either done in a very casual manner or
very organized and disciplined methodologies can be used.
 More data driven the capacity planning is, more accurate the
results.
Capacity Planning of an IT System
Capacity planning needs to
ensure that all Hardware (Disks,
Memory, CPU, and Network),
Software resources (User
Licenses) and facilities are
optimally used.

Software Licenses,
No. of Users
Servers, Storage,
Networking, CPU
Data Center Space,
Power, Cooling
Capacity Planning
We cannot manage
something which we
cannot measure.

Avoid
downtimes by
reducing no of
Incidents

Achieve
Performance
Objectives
established by
business

If no corrective action is
taken based on measured
data, then Capacity
Planning is of no use

Proactive
Capacity
Planning

Reduce TCO for
the ETL System

Achieve optimal
utilization of
computing
Resources
Capacity Planning Steps
Identify Service Level Objectives – know
the requirements in business terms

Analyze Current Capacity – Gather data
about resource consumption, ideal times
and peak usage
Know the future business needs and plan
for future capacity needs – How the IT
systems will be able to handle increased
load
Strike a Balance
As per Moore’s Law, IT is getting cheaper
and faster every 18 months. But
organizations cannot wait for next
generation of technology to be available –
as they need to take care of business.

Performance

Utilization
Supply

Demand
Cost
As per Parkinson’s Law, if you give
more resources to customers, they will
find ways to use more resources. IT
managers cannot keep on giving
unlimited resources to users.

Resources
Capacity Challenges for ETL Systems
ETL jobs are of different types
(Full Refresh and some Delta
Refresh), process varying
amounts of data and are
scheduled at different
frequencies. Therefore, there
are always spikes and valleys
of workload.

SQL queries are simple and do
not require parallelism. On
the other hand in an ETL
system, very large datasets
and processed and Workloads
are random in nature and not
easy to predict. This makes it
difficult to predict the
resource requirement.

An enterprise ETL system
processes thousands of
batch jobs on a daily basis.
These Systems connect to
large no. of data sources
which reside on different
platforms and may be on
different networks across
the WAN

Different types of users have
different peak usage
requirements. They have
different needs for
Transaction times, Elapsed
Times and Response Times
Disks Capacity Issues – Engineers spending lots of time cleaning
old stale data
Over Capacity – Paid for extra compute Capacity, but not
utilizing it
Network Slowness Problems – Batch Jobs running slow
sometimes.
No. of User Licenses reaching limits.
Analyse the Complete Picture
User Needs
Transaction Time
Response Time
Elapsed Time
Throughput Time
Data Usage Patterns

Data Complexity

(Type of SQL Queries or ETL Transformations)
(Financial, Marketing or Factory Data)
Business Terms
Volume and Frequency of Data Loads

User Profile

(No. of Batch Jobs and GB of data processed)

(Simple User or Advanced Data Miner)

Storage
( SAN / NAS / Local Disks,)

Processing Power(CPU, No. of Cores )
Technical Terms

Network Bandwidth
(Transfer Rate, Bytes Tx/Rx)

Memory (Physical, Cache, Swap)
Capacity Planning Tools
Vectors of Measurement
Availability
Performance
Throughput
Utilization
Quality
Efficiency

Simulation
Accurate, but needs
lots of time for setup

Testing
Costly, as another
environment similar
to Production is
needed.

Trending
Can be done using
Excel. Simple, but
does not take non
linear behavior into
account

Analytical Modeling
More advanced,
Faster and Accurate
Data Collection
No. of Subject
Period ( WW or Month) Areas

No. of ETL
No. of Projects Batch Jobs

Storage
Consumption

CPU

Network

Disk I/O

Tx/Rx Bytes

How do we collect Performance / Capacity Data?
OS monitoring tools – even freeware like Nagios, kSar, SQLMon. PerfMon
Data collected in SQL tables
Data collected by Software used by the Storage Frames – gives Utilization, Capacity
and Performance Data
Capacity Model for ETL System ??
Examples of some metrics which can be developed
o Average Run time for a Batch job
o Average CPU for a Batch job
o CPU Utilization /Subject Areas /Week
o CPU Utilization / Project / Week
o No. of Batch Jobs / GB of Storage
o No. of Batch Jobs / X Amount of CPU
Dashboard / Indicators
Phase I
Develop a Trending Model in the beginning

Dashboards can be developed using Share Point BI if the Capacity Data is captured
in an Excel Pivot Table or SQL Databases

Phase II
Can we develop a Predictive Model???
Capacity Management of an ETL System

More Related Content

PPTX
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
PDF
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
PDF
Common Strategies for Improving Performance on Your Delta Lakehouse
Databricks
 
PDF
Deep Dive into the New Features of Apache Spark 3.0
Databricks
 
PDF
[Giovanni Galloro] How to use machine learning on Google Cloud Platform
MeetupDataScienceRoma
 
PDF
Parquet performance tuning: the missing guide
Ryan Blue
 
PPTX
Jvm tuning for low latency application & Cassandra
Quentin Ambard
 
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Databricks
 
Deep Dive into the New Features of Apache Spark 3.0
Databricks
 
[Giovanni Galloro] How to use machine learning on Google Cloud Platform
MeetupDataScienceRoma
 
Parquet performance tuning: the missing guide
Ryan Blue
 
Jvm tuning for low latency application & Cassandra
Quentin Ambard
 

What's hot (20)

PPTX
Apache Calcite overview
Julian Hyde
 
PDF
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
Alluxio, Inc.
 
PDF
Deploying Flink on Kubernetes - David Anderson
Ververica
 
PPTX
Presto: SQL-on-anything
DataWorks Summit
 
PDF
Query Optimization with MySQL 8.0 and MariaDB 10.3: The Basics
Jaime Crespo
 
PDF
Marquez: A Metadata Service for Data Abstraction, Data Lineage, and Event-bas...
Willy Lulciuc
 
PPTX
Oracle GoldenGate 21c New Features and Best Practices
Bobby Curtis
 
PDF
ClickHouse Deep Dive, by Aleksei Milovidov
Altinity Ltd
 
PPTX
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
PPTX
Moving Beyond Lambda Architectures with Apache Kudu
Cloudera, Inc.
 
PPTX
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
 
PDF
Introduction to Greenplum
Dave Cramer
 
PDF
Google BigQuery
Matthias Feys
 
PDF
Under the Hood of a Shard-per-Core Database Architecture
ScyllaDB
 
PPTX
Hive, Presto, and Spark on TPC-DS benchmark
Dongwon Kim
 
PDF
Optimizing S3 Write-heavy Spark workloads
datamantra
 
PPTX
Bootstrapping state in Apache Flink
DataWorks Summit
 
PDF
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Severalnines
 
PDF
Linux tuning to improve PostgreSQL performance
PostgreSQL-Consulting
 
PDF
Rishabh kumar
Rishabh Kumar
 
Apache Calcite overview
Julian Hyde
 
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
Alluxio, Inc.
 
Deploying Flink on Kubernetes - David Anderson
Ververica
 
Presto: SQL-on-anything
DataWorks Summit
 
Query Optimization with MySQL 8.0 and MariaDB 10.3: The Basics
Jaime Crespo
 
Marquez: A Metadata Service for Data Abstraction, Data Lineage, and Event-bas...
Willy Lulciuc
 
Oracle GoldenGate 21c New Features and Best Practices
Bobby Curtis
 
ClickHouse Deep Dive, by Aleksei Milovidov
Altinity Ltd
 
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Moving Beyond Lambda Architectures with Apache Kudu
Cloudera, Inc.
 
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
 
Introduction to Greenplum
Dave Cramer
 
Google BigQuery
Matthias Feys
 
Under the Hood of a Shard-per-Core Database Architecture
ScyllaDB
 
Hive, Presto, and Spark on TPC-DS benchmark
Dongwon Kim
 
Optimizing S3 Write-heavy Spark workloads
datamantra
 
Bootstrapping state in Apache Flink
DataWorks Summit
 
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Severalnines
 
Linux tuning to improve PostgreSQL performance
PostgreSQL-Consulting
 
Rishabh kumar
Rishabh Kumar
 
Ad

Viewers also liked (20)

PPTX
Capacity management for ETL System
ASHOK BHATLA
 
PPTX
Multiple resources for multiple intelligences
Xavier Pradheep Singh
 
PDF
Manage users & tables in Oracle Database
NR Computer Learning Center
 
PPTX
Data flow in Extraction of ETL data warehousing
Dr. Dipti Patil
 
PPTX
Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows rel...
rajappaiyer
 
PPTX
ETL Validator: Flat File Validation
Datagaps Inc
 
PDF
Managing users & tables using Oracle Enterprise Manage
NR Computer Learning Center
 
PPTX
Oracle Tablespace - Basic
Eryk Budi Pratama
 
PPTX
ETL Validator: Creating Data Model
Datagaps Inc
 
PDF
Open Source ETL vs Commercial ETL
Jonathan Levin
 
PDF
Crossref webinar - Maintaining your metadata - latest
Crossref
 
PPTX
Supply and demand management in services
Shwetanshu Gupta
 
PPT
Overview sap bo girona nib efimatica
Efimatica
 
PPT
Strategic capacity planning for products and services
gerlyn bonus
 
PDF
Seven building blocks for MDM
Kousik Mukherjee
 
PPT
Waiting Line Management
Joshua Miranda
 
PPTX
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Khalid Salama
 
PPT
Capacity planning
Abdullah Shahid
 
PDF
How to identify the correct Master Data subject areas & tooling for your MDM...
Christopher Bradley
 
PDF
State of Digital Transformation 2016. Altimeter Report
Den Reymer
 
Capacity management for ETL System
ASHOK BHATLA
 
Multiple resources for multiple intelligences
Xavier Pradheep Singh
 
Manage users & tables in Oracle Database
NR Computer Learning Center
 
Data flow in Extraction of ETL data warehousing
Dr. Dipti Patil
 
Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows rel...
rajappaiyer
 
ETL Validator: Flat File Validation
Datagaps Inc
 
Managing users & tables using Oracle Enterprise Manage
NR Computer Learning Center
 
Oracle Tablespace - Basic
Eryk Budi Pratama
 
ETL Validator: Creating Data Model
Datagaps Inc
 
Open Source ETL vs Commercial ETL
Jonathan Levin
 
Crossref webinar - Maintaining your metadata - latest
Crossref
 
Supply and demand management in services
Shwetanshu Gupta
 
Overview sap bo girona nib efimatica
Efimatica
 
Strategic capacity planning for products and services
gerlyn bonus
 
Seven building blocks for MDM
Kousik Mukherjee
 
Waiting Line Management
Joshua Miranda
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Khalid Salama
 
Capacity planning
Abdullah Shahid
 
How to identify the correct Master Data subject areas & tooling for your MDM...
Christopher Bradley
 
State of Digital Transformation 2016. Altimeter Report
Den Reymer
 
Ad

Similar to Capacity Management of an ETL System (20)

PDF
5063 - IT Operations Analytics Bridging Business and IT
IBM z Systems Software - IT Service Management
 
PPTX
Capacity Management - ROI Goes to the Bottom Line
Precisely
 
PDF
Digital Transformation: How to Run Best-in-Class IT Operations in a World of ...
Precisely
 
PDF
Enterprise Capacity Optimization - Capacity Management Over Everything
TeamQuest Corporation
 
PPT
5701918.ppt
ThuyVu494756
 
PDF
Dit yvol5iss21
Rick Lemieux
 
PPTX
Empower customer success at LinkedIn with advanced analytics and great visual...
Michael Li
 
PDF
Dit yvol2iss25
Rick Lemieux
 
PDF
Time to Come out of the Silo - The Impact of New Technologies on Mainframe Ca...
Precisely
 
PDF
AI Enabling the Modern IT Operating Model
David Favelle
 
PPTX
Justifying Capacity Managment Webinar 4/10
Precisely
 
PPT
Cp Repton
guestea711d0
 
PPT
Airavaat Technologies October 2013
VenkataGiri Puthigai
 
PDF
Justifying Capacity Management Efforts with Provable and Positive ROI
Precisely
 
PDF
Dit yvol1iss3
Rick Lemieux
 
PDF
go.datadriven.whitepaper
Tara Fusco
 
PDF
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
Inside Analysis
 
PDF
Enterprise Artificial Intelligence strategy
Mukesh Sinha
 
PDF
Data Trends for 2019: Extracting Value from Data
Precisely
 
PDF
Accelerating Data Science and Real Time Analytics at Scale
Hortonworks
 
5063 - IT Operations Analytics Bridging Business and IT
IBM z Systems Software - IT Service Management
 
Capacity Management - ROI Goes to the Bottom Line
Precisely
 
Digital Transformation: How to Run Best-in-Class IT Operations in a World of ...
Precisely
 
Enterprise Capacity Optimization - Capacity Management Over Everything
TeamQuest Corporation
 
5701918.ppt
ThuyVu494756
 
Dit yvol5iss21
Rick Lemieux
 
Empower customer success at LinkedIn with advanced analytics and great visual...
Michael Li
 
Dit yvol2iss25
Rick Lemieux
 
Time to Come out of the Silo - The Impact of New Technologies on Mainframe Ca...
Precisely
 
AI Enabling the Modern IT Operating Model
David Favelle
 
Justifying Capacity Managment Webinar 4/10
Precisely
 
Cp Repton
guestea711d0
 
Airavaat Technologies October 2013
VenkataGiri Puthigai
 
Justifying Capacity Management Efforts with Provable and Positive ROI
Precisely
 
Dit yvol1iss3
Rick Lemieux
 
go.datadriven.whitepaper
Tara Fusco
 
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
Inside Analysis
 
Enterprise Artificial Intelligence strategy
Mukesh Sinha
 
Data Trends for 2019: Extracting Value from Data
Precisely
 
Accelerating Data Science and Real Time Analytics at Scale
Hortonworks
 

More from ASHOK BHATLA (8)

PPT
Smart Electric Meters - Role of Govt. in Technology Management
ASHOK BHATLA
 
PPT
World innovation - Knowledge Competitiveness Index
ASHOK BHATLA
 
PPT
R&d management trending between india, china and us
ASHOK BHATLA
 
PPTX
Ashok career map
ASHOK BHATLA
 
PPT
Data centers site selection mathematical model - may 2012
ASHOK BHATLA
 
PPT
Dc energy efficiency presentation for psu lecture - ashok bhatla - final
ASHOK BHATLA
 
PPT
Solar lantern technology adoption model for indian villages - final
ASHOK BHATLA
 
PPT
Emerging Technology Products for Indian Villages
ASHOK BHATLA
 
Smart Electric Meters - Role of Govt. in Technology Management
ASHOK BHATLA
 
World innovation - Knowledge Competitiveness Index
ASHOK BHATLA
 
R&d management trending between india, china and us
ASHOK BHATLA
 
Ashok career map
ASHOK BHATLA
 
Data centers site selection mathematical model - may 2012
ASHOK BHATLA
 
Dc energy efficiency presentation for psu lecture - ashok bhatla - final
ASHOK BHATLA
 
Solar lantern technology adoption model for indian villages - final
ASHOK BHATLA
 
Emerging Technology Products for Indian Villages
ASHOK BHATLA
 

Recently uploaded (20)

PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Software Development Methodologies in 2025
KodekX
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
The Future of Artificial Intelligence (AI)
Mukul
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 

Capacity Management of an ETL System

  • 1. Capacity Model of an ETL system Ashok Bhatla Email – [email protected]
  • 2. What is Business Intelligence? Business Intelligence (BI) is a combination of tools, processes and software which help a company to transform data into actionable knowledge, thereby allowing them to take faster and informed decisions in order to achieve their strategic goals. It’s all about providing right information to the management at the right time with the lowest possible cost. As we are drowning in data, but starving for knowledge, Business Intelligence has become the No. 1 priority for IT Managers today.
  • 3. What is ETL? ETL stands for Extract, Transform and Load. A transactional system is meant to be a high performance system so that users can get their work faster. Running some reports from a Transactional system makes it slower. Therefore, the concept of ETL gained popularity. In computing, Extract, Transform, and Load (ETL) refers to a process in database usage which involves the following steps Extracts data from outside sources. Transforms it to fit operational needs, which can include joining/reformatting some tables. Loads it into the end target (database, more specifically, operational data store, data mart, or data warehouse)
  • 4. Example of ETL OLTP Systems Cost Accounting System Payroll Data ETL – Joins, Transforms, Deletes etc. Load Data Sales Data Staged Data Purchasing Data EDW / Reporting Data
  • 5. What is Capacity Planning?  Capacity Planning is the process of identifying the current computing needs of a business application and to forecast the future computing needs based on the business plans.  In other words, it means what computing resources are needed to meet an application’s service level objectives over a period of time.  In today’s economic climate, business requirements can change rapidly depending upon an organization’s strategy and goals.  Therefore properly managed capacity plans should be able to take unforeseen requirements into account.  Capacity Planning can be either done in a very casual manner or very organized and disciplined methodologies can be used.  More data driven the capacity planning is, more accurate the results.
  • 6. Capacity Planning of an IT System Capacity planning needs to ensure that all Hardware (Disks, Memory, CPU, and Network), Software resources (User Licenses) and facilities are optimally used. Software Licenses, No. of Users Servers, Storage, Networking, CPU Data Center Space, Power, Cooling
  • 7. Capacity Planning We cannot manage something which we cannot measure. Avoid downtimes by reducing no of Incidents Achieve Performance Objectives established by business If no corrective action is taken based on measured data, then Capacity Planning is of no use Proactive Capacity Planning Reduce TCO for the ETL System Achieve optimal utilization of computing Resources
  • 8. Capacity Planning Steps Identify Service Level Objectives – know the requirements in business terms Analyze Current Capacity – Gather data about resource consumption, ideal times and peak usage Know the future business needs and plan for future capacity needs – How the IT systems will be able to handle increased load
  • 9. Strike a Balance As per Moore’s Law, IT is getting cheaper and faster every 18 months. But organizations cannot wait for next generation of technology to be available – as they need to take care of business. Performance Utilization Supply Demand Cost As per Parkinson’s Law, if you give more resources to customers, they will find ways to use more resources. IT managers cannot keep on giving unlimited resources to users. Resources
  • 10. Capacity Challenges for ETL Systems ETL jobs are of different types (Full Refresh and some Delta Refresh), process varying amounts of data and are scheduled at different frequencies. Therefore, there are always spikes and valleys of workload. SQL queries are simple and do not require parallelism. On the other hand in an ETL system, very large datasets and processed and Workloads are random in nature and not easy to predict. This makes it difficult to predict the resource requirement. An enterprise ETL system processes thousands of batch jobs on a daily basis. These Systems connect to large no. of data sources which reside on different platforms and may be on different networks across the WAN Different types of users have different peak usage requirements. They have different needs for Transaction times, Elapsed Times and Response Times
  • 11. Disks Capacity Issues – Engineers spending lots of time cleaning old stale data Over Capacity – Paid for extra compute Capacity, but not utilizing it Network Slowness Problems – Batch Jobs running slow sometimes. No. of User Licenses reaching limits.
  • 12. Analyse the Complete Picture User Needs Transaction Time Response Time Elapsed Time Throughput Time Data Usage Patterns Data Complexity (Type of SQL Queries or ETL Transformations) (Financial, Marketing or Factory Data) Business Terms Volume and Frequency of Data Loads User Profile (No. of Batch Jobs and GB of data processed) (Simple User or Advanced Data Miner) Storage ( SAN / NAS / Local Disks,) Processing Power(CPU, No. of Cores ) Technical Terms Network Bandwidth (Transfer Rate, Bytes Tx/Rx) Memory (Physical, Cache, Swap)
  • 13. Capacity Planning Tools Vectors of Measurement Availability Performance Throughput Utilization Quality Efficiency Simulation Accurate, but needs lots of time for setup Testing Costly, as another environment similar to Production is needed. Trending Can be done using Excel. Simple, but does not take non linear behavior into account Analytical Modeling More advanced, Faster and Accurate
  • 14. Data Collection No. of Subject Period ( WW or Month) Areas No. of ETL No. of Projects Batch Jobs Storage Consumption CPU Network Disk I/O Tx/Rx Bytes How do we collect Performance / Capacity Data? OS monitoring tools – even freeware like Nagios, kSar, SQLMon. PerfMon Data collected in SQL tables Data collected by Software used by the Storage Frames – gives Utilization, Capacity and Performance Data
  • 15. Capacity Model for ETL System ?? Examples of some metrics which can be developed o Average Run time for a Batch job o Average CPU for a Batch job o CPU Utilization /Subject Areas /Week o CPU Utilization / Project / Week o No. of Batch Jobs / GB of Storage o No. of Batch Jobs / X Amount of CPU
  • 16. Dashboard / Indicators Phase I Develop a Trending Model in the beginning Dashboards can be developed using Share Point BI if the Capacity Data is captured in an Excel Pivot Table or SQL Databases Phase II Can we develop a Predictive Model???