SlideShare a Scribd company logo
SQL Server Integration
Services and Analysis Services
Mohan
Consultant
Certification
Mohan Arumugam
Technologies Specialist
E-mail : moohanan@gmail.com
Phone : +91 99406 53876
Profile
Blogger
Trainer
Who Am I ?
Agenda • Overview of Integration Services
• Principles of Good Package Design
• Component Drilldown
• Performance Tuning
A DW architecture
Datawarehouse
(SQL Server, Oracle,
DB2, Teradata)
SQL/Oracle SAP/Dynamics Legacy Text XML
Integration Services
Reports Dashboards Scorecards Excel BI tools
Analysis Services
Session Objectives
 Assumptions
 Experience with SSIS and SSAS
 Goals
 Discuss design, performance, and scalability for building ETL packages and cubes (UDMs)
 Best practices
 Common mistakes
SQL Server 2005 BPA availability!
 BPA = Best Practice Analyzer
 Utility that scans your SQL Server metadata and recommends best practices
 Best practices from dev team and Customer Support Services
 What’s new:
 Support for SQL Server 2005
 Support for Analysis Services and Integration Services
 Scan scheduling
 Auto update framework
 CTP available now, RTM April
 http://www.microsoft.com/downloads/details.aspx?FamilyId=DA0531E4-E94C-4991-82FA-F0E3FB
D05E63&displaylang=en
Agenda
 Integration Services
 Quick overview of IS
 Principles of Good Package Design
 Component Drilldown
 Performance Tuning
 Analysis Services
 UDM overview
 UDM design best practices
 Performance tips
What is SQL Server Integration
Services?
 Introduced in SQL Server 2005
 The successor to Data Transformation Services
 The platform for a new generation of high-performance data integration technologies
Call center data: semi structured
Legacy data: binary files
Application database
ETL
Warehouse
Reports
Mobile
data
Data mining
Alerts & escalation
•Integration and warehousing require separate, staged operations.
•Preparation of data requires different, often incompatible, tools.
•Reporting and escalation is a slow process, delaying smart responses.
•Heavy data volumes make this scenario increasingly unworkable.
Hand
coding
Staging
Text mining
ETL Staging
Cleansing
&
ETL
Staging
ETL
ETL Objective: Before SSIS
Call center:
semi-structured data
Legacy data: binary files
Application database
•Integration and warehousing are a seamless, manageable operation.
•Source, prepare, and load data in a single, auditable process.
•Reporting and escalation can be parallelized with the warehouse load.
•Scales to handle heavy and complex data requirements.
SQL Server Integration Services
Text mining
components
Custom
source
Standard
sources
Data-cleansing
components
Merges
Data mining
components
Warehouse
Reports
Mobile
data
Alerts & escalation
Changing the Game with SSIS
SSIS Architecture
 Control Flow (Runtime)
 A parallel workflow engine
 Executes containers and tasks
 Data Flow (“Pipeline”)
 A special runtime task
 A high-performance data pipeline
 Applies graphs of components to data movement
 Component can be sources, transformations or destinations
 Highly parallel operations possible
Agenda
 Overview of Integration Services
 Principles of Good Package Design
 Component Drilldown
 Performance Tuning
Principles of Good Package Design -
General
 Follow Microsoft Development Guidelines
 Iterative design, development & testing
 Understand the Business
 Understanding the people & processes are critical for success
 Kimball’s “Data Warehouse ETL Toolkit” book is an excellent reference
 Get the big picture
 Resource contention, processing windows, …
 SSIS does not forgive bad database design
 Old principles still apply – e.g. load with/without indexes?
 Platform considerations
 Will this run on IA64 / X64?
 No BIDS on IA64 – how will I debug?
 Is OLE-DB driver XXX available on IA64?
 Memory and resource usage on different platforms
Principles of Good Package Design -
Architecture
 Process Modularity
 Break complex ETL into logically distinct packages (vs. monolithic design)
 Improves development & debug experience
 Package Modularity
 Separate sub-processes within package into separate Containers
 More elegant, easier to develop
 Simple to disable whole Containers when debugging
 Component Modularity
 Use Script Task/Transform for one-off problems
 Build custom components for maximum re-use
Bad Modularity
Good Modularity
Principles of Good Package Design -
Infrastructure
 Use Package Configurations
 Build it in from the start
 Will make things easier later on
 Simplify deployment Dev  QA  Production
 Use Package Logging
 Performance & debugging
 Build in Security from the start
 Credentials and other sensitive info
 Package & Process IP
 Configurations & Parameters
Principles of Good Package Design -
Development
 SSIS is visual programming!
 Use source code control system
 Undo is not as simple in a GUI environment!
 Improved experience for multi-developer environment
 Comment your packages and scripts
 In 2 weeks even you may forget a subtlety of your design
 Someone else has to maintain your code
 Use error-handling
 Use the correct precedence constraints on tasks
 Use the error outputs on transforms – store them in a table for processing later, or use downstream if
the error can be handled in the package
 Try…Catch in your scripts
Component Drilldown - Tasks &
Transforms
 Avoid over-design
 Too many moving parts is inelegant and likely slow
 But don’t be afraid to experiment – there are many ways to solve a problem
 Maximize Parallelism
 Allocate enough threads
 EngineThreads property on DataFlow Task
 “Rule of thumb” - # of datasources + # of async components
 Minimize blocking
 Synchronous vs. Asynchronous components
 Memcopy is expensive – reduce the number of asynchronous components in a flow if
possible – example coming up later
 Minimize ancillary data
 For example, minimize data retrieved by LookupTx
Debugging & Performance Tuning -
General
 Leverage the logging and auditing features
 MsgBox is your friend
 Pipeline debuggers are your friend
 Use the throughput component from Project REAL
 Experiment with different techniques
 Use source code control system
 Focus on the bottlenecks – methodology discussed later
 Test on different platforms
 32bit, IA64, x64
 Local Storage, SAN
 Memory considerations
 Network & topology considerations
Debugging & Performance Tuning -
Volume
 Remove redundant columns
 Use SELECT statements as opposed to tables
 SELECT * is your enemy
 Also remove redundant columns after every async component!
 Filter rows
 WHERE clause is your friend
 Conditional Split in SSIS
 Concatenate or re-route unneeded columns
 Parallel loading
 Source system split source data into multiple chunks
 Flat Files – multiple files
 Relational – via key fields and indexes
 Multiple Destination components all loading same table
Debugging & Performance Tuning -
Application
 Is BCP good enough?
 Overhead of starting up an SSIS package may offset any performance gain over BCP for small data sets
 Is the greater manageability and control of SSIS needed?
 Which pattern?
 Many Lookup patterns possible – which one is most suitable?
 See Project Real for examples of patterns:
http://www.microsoft.com/sql/solutions/bi/projectreal.mspx
 Which component?
 Bulk Import Task vs. Data Flow
 Bulk Import might give better performance if there are no transformations or filtering required, and the destination is SQL Server.
 Lookup vs. MergeJoin (LeftJoin) vs. set based statements in SQL
 MergeJoin might be required if you’re not able to populate the lookup cache.
 Set based SQL statements might provide a way to persist lookup cache misses and apply a set based operation for higher performance.
 Script vs. custom component
 Script might be good enough for small transforms that’re typically not reused
Case Study - Patterns
105 seconds 83 seconds
Use Error Output for
handling Lookup miss
Ignore lookup errors and check for null
looked up values in Derived Column
Debugging & Performance Tuning – A
methodology
 Optimize and Stabilize the basics
 Minimize staging (else use RawFiles if possible)
 Make sure you have enough Memory
 Windows, Disk, Network, …
 SQL FileGroups, Indexing, Partitioning
 Get Baseline
 Replace destinations with RowCount
 Source->RowCount throughput
 Source->Destination throughput
 Incrementally add/change components to see effect
 This could include the DB layer
 Use source code control!
 Optimize slow components for resources available
Case Study - Parallelism
 Focus on critical path
 Utilize available resources
Memory Constrained Reader and CPU Constrained
Let it rip! Optimize the slowest
Summary
 Follow best practice development methods
 Understand how SSIS architecture influences performance
 Buffers, component types
 Design Patterns
 Learn the new features
 But do not forget the existing principles
 Use the native functionality
 But do not be afraid to extend
 Measure performance
 Focus on the bottlenecks
 Maximize parallelism and memory use where appropriate
 Be aware of different platforms capabilities (64bit RAM)
 Testing is key
Analysis Services
Agenda
Server architecture and UDM basics
Optimizing the cube design
Partitioning and Aggregations
Processing
Queries and calculations
Conclusion
Analysis
Server
OLEDB
ADOMD
.NET
AMO IIS
TCP
HTTP
XMLA
ADOMD
Client
Apps
BIDS
SSMS
Profiler
Excel
Client Server Architecture
Dimension
An entity on which analysis is to be performed
(e.g. Customers)
Consists of:
Attributes that describe the entity
Hierarchies that organize dimension members in meaningful ways
Customer
ID
First
Name
Last
Name
State City Marital
Status
Gender … Age
123 John Doe WA Seattle Married Male … 42
456 Lance Smith WA Redmond Unmarried Male … 34
789 Jill Thompson OR Portland Married Female … 21
Attribute
 Containers of dimension members.
 Completely define the dimensional space.
 Enable slicing and grouping the dimensional
space in interesting ways.
 Customers in state WA and age > 50
 Customers who are married and male
 Typically have one-many relationships
 City  State, State  Country, etc.
 All attributes implicitly related to the key
Hierarchy
Ordered collection of attributes into levels
Navigation path through dimensional space
User defined hierarchies – typically multiple
levels
Attribute hierarchies – implicitly created for
each attribute – single level
Customers by Geography
Country
State
City
Customer
Customers by Demographics
Marital
Gender
Customer
Customer
City
State
Country
Gender Marital
Country
State
City
Customer
Gender
Customer
Marital
Gender
Customer
Customer
City
State
Country
Gender
Marital
Attributes Hierarchies
Age
Dimension Model
Cube
Collection of dimensions and measures
Measure  numeric data associated with a set
of dimensions (e.g. Qty Sold, Sales Amount,
Cost)
Multi-dimensional space
Defined by dimensions and measures
E.g. (Customers, Products, Time, Measures)
Intersection of dimension members and measures is
a cell (USA, Bikes, 2004, Sales Amount) = $1,523,374.83
A Cube
Product
Peas Corn Bread Milk Beer
M
a
r
k
e
t
Bos
NYC
Chi
Sea
Jan
Mar
Feb
Time
Units of beer sold in
Boston in January
Measure Group
Group of measures with same dimensionality
Analogous to fact table
Cube can contain more than one measure
group
E.g. Sales, Inventory, Finance
Multi-dimensional space
Subset of dimensions and measures in the cube
AS2000 comparison
Virtual Cube  Cube
Cube  Measure Group
Sales Inventory Finance
Customers X
Products X X
Time X X X
Promotions X
Warehouse X
Department X
Account X
Scenario X
Measure Group
Measure Group
Agenda
 Server architecture and UDM Basics
 Optimizing the cube design
 Partitioning and Aggregations
 Processing
 Queries and calculations
 Conclusion
Top 3 Tenets of Good Cube Design
Attribute relationships
Attribute relationships
Attribute relationships
Attribute Relationships
One-to-many relationships between attributes
Server simply “works better” if you define them where applicable
Examples:
City  State, State  Country
Day  Month, Month  Quarter, Quarter  Year
Product Subcategory  Product Category
Rigid v/s flexible relationships (default is flexible)
Customer  City, Customer  PhoneNo are flexible
Customer  BirthDate, City  State are rigid
All attributes implicitly related to key attribute
Customer
City
State Country
Gender Marital Age
Attribute Relationships (continued)
Customer
City
State
Country
Gender Marital Age
Attribute Relationships (continued)
Attribute Relationships
Where are they used?
MDX Semantics
Tells the formula engine how to roll up measure values
If the grain of the measure group is different from the key attribute (e.g. Sales
by Month)
Attribute relationships from grain to other attributes required (e.g. Month  Quarter,
Quarter  Year)
Otherwise no data (NULL) returned for Quarter and Year
MDX Semantics explained in detail at:
http://www.sqlserveranalysisservices.com/OLAPPapers/AttributeRelationships.htm
Attribute Relationships
Where are they used?
Storage
Reduces redundant relationships between dimension members – normalizes
dimension storage
Enables clustering of records within partition segments (e.g. store facts for a
month together)
Processing
Reduces memory consumption in dimension processing – less hash tables to
fit in memory
Allows large dimensions to push 32-bit barrier
Speeds up dimension and partition processing overall
Attribute Relationships
Where are they used?
Query performance
Dimension storage access is faster
Produces more optimal execution plans
Aggregation design
Enables aggregation design algorithm to produce effective set of
aggregations
Dimension security
DeniedSet = {State.WA} should deny cities and customers in WA – requires
attribute relationships
Member properties
Attribute relationships identify member properties on levels
Attribute Relationships
How to set them up?
Creating an attribute relationship is easy, but
…
Pay careful attention to the key columns!
Make sure every attribute has unique key columns (add composite keys as
needed)
There must be a 1:M relation between the key
columns of the two attributes
Invalid key columns cause a member to have
multiple parents
Dimension processing picks one parent arbitrarily and succeeds
Hierarchy looks wrong!
Attribute Relationships
How to set them up?
Don’t forget to remove redundant
relationships!
All attributes start with relationship to key
Customer  City  State  Country
Customer  State (redundant)
Customer  Country (redundant)
Attribute Relationships
Example
Time dimension
Day, Week, Month, Quarter, Year
Year: 2003 to 2010
Quarter: 1 to 4
Month: 1 to 12
Week: 1 to 52
Day: 20030101 to 20101231
Day
Week
Month
Quarter
Year
Attribute Relationships
Example
Time dimension
Day, Week, Month, Quarter, Year
Year: 2003 to 2010
Quarter: 1 to 4 - Key columns (Year, Quarter)
Month: 1 to 12
Week: 1 to 52
Day: 20030101 to 20101231
Day
Week
Month
Quarter
Year
Time dimension
Day, Week, Month, Quarter, Year
Year: 2003 to 2010
Quarter: 1 to 4 - Key columns (Year, Quarter)
Month: 1 to 12 - Key columns (Year, Month)
Week: 1 to 52
Day: 20030101 to 20101231
Attribute Relationships
Example
Day
Week
Month
Quarter
Year
Time dimension
Day, Week, Month, Quarter, Year
Year: 2003 to 2010
Quarter: 1 to 4 - Key columns (Year, Quarter)
Month: 1 to 12 - Key columns (Year, Month)
Week: 1 to 52 - Key columns (Year, Week)
Day: 20030101 to 20101231
Attribute Relationships
Example
Day
Week
Month
Quarter
Year
Defining Attribute Relationships
User Defined Hierarchies
Pre-defined navigation paths thru dimensional
space defined by attributes
Attribute hierarchies enable ad hoc navigation
Why create user defined hierarchies then?
Guide end users to interesting navigation paths
Existing client tools are not “attribute aware”
Performance
Optimize navigation path at processing time
Materialization of hierarchy tree on disk
Aggregation designer favors user defined hierarchies
Natural Hierarchies
1:M relation (via attribute relationships)
between every pair of adjacent levels
Examples:
Country-State-City-Customer (natural)
Country-City (natural)
State-Customer (natural)
Age-Gender-Customer (unnatural)
Year-Quarter-Month (depends on key columns)
How many quarters and months?
4 & 12 across all years (unnatural)
4 & 12 for each year (natural)
Natural Hierarchies
Best Practice for Hierarchy Design
Performance implications
Only natural hierarchies are materialized on disk during processing
Unnatural hierarchies are built on the fly during queries (and cached in
memory)
Server internally decomposes unnatural hierarchies into natural components
Essentially operates like ad hoc navigation path (but somewhat better)
Create natural hierarchies where possible
Using attribute relationships
Not always appropriate (e.g. Age-Gender)
Best Practices for Cube Design
Dimensions
Consolidate multiple hierarchies into single dimension (unless they are
related via fact table)
Avoid ROLAP storage mode if performance is key
Use role playing dimensions (e.g. OrderDate, BillDate, ShipDate) - avoids
multiple physical copies
Use parent-child dimensions prudently
No intermediate level aggregation support
Use many-to-many dimensions prudently
Slower than regular dimensions, but faster than calculations
Intermediate measure group must be “small” relative to primary measure group
Best Practices for Cube Design
Attributes
Define all possible attribute relationships!
Mark attribute relationships as rigid where appropriate
Use integer (or numeric) key columns
Set AttributeHierarchyEnabled to false for attributes not used for navigation
(e.g. Phone#, Address)
Set AttributeHierarchyOptimizedState to NotOptimized for infrequently used
attributes
Set AttributeHierarchyOrdered to false if the order of members returned by
queries is not important
Hierarchies
Use natural hierarchies where possible
Best Practices for Cube Design
Measures
Use smallest numeric data type possible
Use semi-additive aggregate functions instead of MDX calculations to achieve
same behavior
Put distinct count measures into separate measure group (BIDS does this
automatically)
Avoid string source column for distinct count measures
Agenda
 Server architecture and UDM Basics
 Optimizing the cube design
 Partitioning and Aggregations
 Processing
 Queries and calculations
 Conclusion
Partitioning
Mechanism to break up large cube into
manageable chunks
Partitions can be added, processed,
deleted independently
Update to last month’s data does not affect prior months’ partitions
Sliding window scenario easy to implement
E.g. 24 month window  add June 2006 partition and delete June 2004
Partitions can have different storage settings
Partitions require Enterprise Edition!
Benefits of Partitioning
Partitions can be processed and queried
in parallel
Better utilization of server resources
Reduced data warehouse load times
Queries are isolated to relevant partitions  less data to scan
SELECT … FROM … WHERE [Time].[Year].[2006]
Queries only 2006 partitions
Bottom line  partitions enable:
Manageability
Performance
Scalability
Best Practices for Partitioning
No more than 20M rows per partition
Specify partition slice
Optional for MOLAP – server auto-detects the slice and validates against user
specified slice (if any)
Must be specified for ROLAP
Manage storage settings by usage patterns
Frequently queried  MOLAP with lots of aggs
Periodically queried  MOLAP with less or no aggs
Historical  ROLAP with no aggs
Alternate disk drive - use multiple controllers to avoid I/O
contention
Best Practices for Aggregations
Define all possible attribute relationships
Set accurate attribute member counts and fact
table counts
Set AggregationUsage to guide agg designer
Set rarely queried attributes to None
Set commonly queried attributes to Unrestricted
Do not build too many aggregations
In the 100s, not 1000s!
Do not build aggregations larger than 30% of
fact table size (agg design algorithm doesn’t)
Best Practices for Aggregations
Aggregation design cycle
Use Storage Design Wizard (~20% perf gain) to design initial set of
aggregations
Enable query log and run pilot workload (beta test with limited set of users)
Use Usage Based Optimization (UBO) Wizard to refine aggregations
Use larger perf gain (70-80%)
Reprocess partitions for new aggregations to
take effect
Periodically use UBO to refine aggregations
Agenda
 Server architecture and UDM Basics
 Optimizing the cube design
 Partitioning and Aggregations
 Processing
 Queries and calculations
 Conclusion
Improving Processing
 SQL Server Performance Tuning
 Improve the queries that are used for extracting data from SQL Server
 Check for proper plans and indexing
 Conduct regular SQL performance tuning process
 AS Processing Improvements
 Use SP2 !!
 Processing 20 partitions: SP1 1:56, SP2: 1:06
 Don’t let UI default for parallel processing
 Go into advanced processing tab and change it
 Monitor the values:
 Maximum number of datasource connections
 MaxParallel – How many partitions processed in parallel, don’t let the server decide on its own.
 Use INT for keys, if possible.
Parallel processing requires Enterprise Edition!
Improving Processing
 For best performance use ASCMD.EXE and
XMLA
 Use <Parallel> </Parallel> to group processing tasks together until Server is
using maximum resources
 Proper use of <Transaction> </Transaction>
 ProcessFact and ProcessIndex separately
instead of ProcessFull (for large partitions)
 Consumes less memory.
 ProcessClearIndexes deletes existing indexes
and ProcessIndexes generates or reprocesses
Best Practices for Processing
Partition processing
Monitor aggregation processing spilling to disk (perfmon counters for temp
file usage)
Add memory, turn on /3GB, move to x64/ia64
Fully process partitions periodically
Achieves better compression over repeated incremental processing
Data sources
Avoid using .NET data sources – OLEDB is faster for processing
Agenda
 Server architecture
 UDM Basics
 Optimizing the cube design
 Partitioning and Aggregations
 Processing
 Queries and calculations
 Conclusion
Non_Empty_Behavior
Most client tools (Excel, Proclarity) display non empty results –
eliminate members with no data
With no calculations, non empty is fast – just checks fact data
With calculations, non empty can be slow – requires evaluating
formula for each cell
Non_Empty_Behavior allows non empty on calculations to just
check fact data
Note: query processing hint – use with care!
Create Member [Measures].[Internet Gross Profit] As
[Internet Sales Amount] - [Internet Total Cost],
Format_String = "Currency",
Non_Empty_Behavior = [Internet Sales Amount];
Auto-Exists
Attributes/hierarchies within a dimension are always existed
together
City.Seattle * State.Members returns {(Seattle, WA)}
(Seattle, OR), (Seattle, CA) do not “exist”
Exploit the power of auto-exists
Use Exists/CrossJoin instead of .Properties – faster
Requires attribute hierarchy enabled on member property
Filter(Customer.Members,
Customer.CurrentMember.Properties(“Gender”) = “Male”)
Exists(Customer.Members, Gender.[Male])
Conditional Statement: IIF
Use scopes instead of conditions such as Iif/Case
Scopes are evaluated once statically
Conditions are evaluated dynamically for each cell
Always try to coerce IIF for one branch to be null
Create Member Measures.Sales As
Iif(Currency.CurrentMember Is Currency.USD,
Measures.SalesUSD, Measures.SalesUSD * Measures.XRate);
Create Member Measures.Sales As Null;
Scope(Measures.Sales, Currency.Members);
This = Measures.SalesUSD * Measures.XRate;
Scope(Currency.USA);
This = Measures.SalesUSD;
End Scope;
End Scope;
Best Practices for MDX
Use calc members instead of calc cells where possible
Use .MemberValue for calcs on numeric attributes
Filter(Customer.members, Salary.MemberValue > 100000)
Avoid redundant use of .CurrentMember and .Value
(Time.CurrentMember.PrevMember, Measures.CurrentMember ).Value can be replaced with
Time.PrevMember
Avoid LinkMember, StrToSet, StrToMember, StrToValue
Replace simple calcs with computed columns in DSV
Calculation done at processing time is always better
Many more at:
Analysis Services Performance Whitepaper:
http://download.microsoft.com/download/8/5/e/85eea4fa-b3bb-4426-97d0-7f7151b2011c/SSAS2005P
erfGuide.doc
http://sqljunkies.com/weblog/mosha
http://sqlserveranalysisservices.com
Conclusion
AS2005 is major re-architecture from AS2000
Design for perf & scalability from the start
Many principles carry through from AS2000
Dimensional design, Partitioning, Aggregations
Many new principles in AS2005
Attribute relationships, natural hierarchies
New design alternatives – role playing, many-to-many, reference dimensions, semi-additive
measures
Flexible processing options
MDX scripts, scopes
Use Analysis Services with SQL Server Enterprise Edition to get max
performance and scale
Resources
 SSIS
 SQL Server Integration Services site – links to blogs, training, partners, etc.:
http://msdn.microsoft.com/SQL/sqlwarehouse/SSIS/default.aspx
 SSIS MSDN Forum: http://forums.microsoft.com/MSDN/ShowForum.aspx?ForumID=80&SiteID=1
 SSIS MVP community site:
http://www.sqlis.com
 SSAS
 BLOGS: http://blogs.msdn.com/sqlcat
 PROJECT REAL-Business Intelligence in Practice
 Analysis Services Performance Guide
 TechNet: Analysis Services for IT Professionals
 Microsoft BI
 SQL Server Business Intelligence public site: http://www.microsoft.com/sql/evaluation/bi/default.asp
 http://www.microsoft.com/bi
Mohan Arumugam
Technologies Specialist
E-mail : moohanan@gmail.com
Phone : +91 99406 53876
Profile
Thank You

More Related Content

Similar to SQL Server Integration Services and Analysis Services (20)

PPT
It ready dw_day3_rev00
Siwawong Wuttipongprasert
 
PDF
Best Practices SQL 2005 SSIS
ptolozah
 
PDF
SQL Server Integration Services.pdf
bhuvangates
 
PPTX
Continuous Integration and the Data Warehouse - PASS SQL Saturday Slovenia
Dr. John Tunnicliffe
 
PPTX
Continuous Integration and the Data Warehouse - PASS SQL Saturday Slovenia
Dr. John Tunnicliffe
 
PPTX
Ssis ssas sps_mdx_hong_bingli
Hong-Bing Li
 
PDF
Msbi course content
United Global Soft
 
PPSX
Top new ssis 2012 features
Miguel Cebollero
 
PPT
Ssis optimization –better designs
varunragul
 
PPT
3.1\9 SSIS 2008R2_Training - ControlFlow asks
Pramod Singla
 
DOC
AAO BI Resume
Al Ottley
 
PPT
Daniel Bowlin Portfolio Rev1
DanielWBowlin
 
PDF
SSIS Training.pdf
SpiritsoftsTraining
 
PPT
Integration Services Presentation
Catherine Eibner
 
PPT
Integration Services Presentation V2
Catherine Eibner
 
DOCX
PASHA MSBI
KALEEL PASHA
 
PPT
Bi Ppt Portfolio Elmer Donavan
EJDonavan
 
DOC
ShashankJainMSBI
Shashank Jain
 
PPTX
Microsoft BI Stack Portfolio
Angela Trapp
 
PPTX
SSIS: Flow tasks, containers and precedence constraints
Kiki Noviandi
 
It ready dw_day3_rev00
Siwawong Wuttipongprasert
 
Best Practices SQL 2005 SSIS
ptolozah
 
SQL Server Integration Services.pdf
bhuvangates
 
Continuous Integration and the Data Warehouse - PASS SQL Saturday Slovenia
Dr. John Tunnicliffe
 
Continuous Integration and the Data Warehouse - PASS SQL Saturday Slovenia
Dr. John Tunnicliffe
 
Ssis ssas sps_mdx_hong_bingli
Hong-Bing Li
 
Msbi course content
United Global Soft
 
Top new ssis 2012 features
Miguel Cebollero
 
Ssis optimization –better designs
varunragul
 
3.1\9 SSIS 2008R2_Training - ControlFlow asks
Pramod Singla
 
AAO BI Resume
Al Ottley
 
Daniel Bowlin Portfolio Rev1
DanielWBowlin
 
SSIS Training.pdf
SpiritsoftsTraining
 
Integration Services Presentation
Catherine Eibner
 
Integration Services Presentation V2
Catherine Eibner
 
PASHA MSBI
KALEEL PASHA
 
Bi Ppt Portfolio Elmer Donavan
EJDonavan
 
ShashankJainMSBI
Shashank Jain
 
Microsoft BI Stack Portfolio
Angela Trapp
 
SSIS: Flow tasks, containers and precedence constraints
Kiki Noviandi
 

More from Mohan Arumugam (7)

PPTX
Introduction to Sharepoint 2013 Devlopment
Mohan Arumugam
 
PPTX
Introduction - DotNet 4.0 Intro by Mohan
Mohan Arumugam
 
PPTX
Python - List, Dictionaries, Tuples,Sets
Mohan Arumugam
 
PPTX
Python-Yesterday Today Tomorrow(What's new?)
Mohan Arumugam
 
PPTX
Project Management Challenges
Mohan Arumugam
 
PPTX
Power Shell and Sharepoint 2013
Mohan Arumugam
 
PPTX
SharePoint Object Model, Web Services and Events
Mohan Arumugam
 
Introduction to Sharepoint 2013 Devlopment
Mohan Arumugam
 
Introduction - DotNet 4.0 Intro by Mohan
Mohan Arumugam
 
Python - List, Dictionaries, Tuples,Sets
Mohan Arumugam
 
Python-Yesterday Today Tomorrow(What's new?)
Mohan Arumugam
 
Project Management Challenges
Mohan Arumugam
 
Power Shell and Sharepoint 2013
Mohan Arumugam
 
SharePoint Object Model, Web Services and Events
Mohan Arumugam
 
Ad

Recently uploaded (20)

PDF
Biological Bilingual Glossary Hindi and English Medium
World of Wisdom
 
PDF
QNL June Edition hosted by Pragya the official Quiz Club of the University of...
Pragya - UEM Kolkata Quiz Club
 
PDF
0725.WHITEPAPER-UNIQUEWAYSOFPROTOTYPINGANDUXNOW.pdf
Thomas GIRARD, MA, CDP
 
PDF
ARAL-Orientation_Morning-Session_Day-11.pdf
JoelVilloso1
 
PDF
Horarios de distribución de agua en julio
pegazohn1978
 
PPTX
care of patient with elimination needs.pptx
Rekhanjali Gupta
 
PPTX
GRADE-3-PPT-EVE-2025-ENG-Q1-LESSON-1.pptx
EveOdrapngimapNarido
 
PPTX
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
PPTX
QUARTER 1 WEEK 2 PLOT, POV AND CONFLICTS
KynaParas
 
PDF
Stokey: A Jewish Village by Rachel Kolsky
History of Stoke Newington
 
PDF
DIGESTION OF CARBOHYDRATES,PROTEINS,LIPIDS
raviralanaresh2
 
PDF
Dimensions of Societal Planning in Commonism
StefanMz
 
PPTX
PPT-Q1-WEEK-3-SCIENCE-ERevised Matatag Grade 3.pptx
reijhongidayawan02
 
PPTX
MENINGITIS: NURSING MANAGEMENT, BACTERIAL MENINGITIS, VIRAL MENINGITIS.pptx
PRADEEP ABOTHU
 
PPTX
How to Convert an Opportunity into a Quotation in Odoo 18 CRM
Celine George
 
PDF
Generative AI: it's STILL not a robot (CIJ Summer 2025)
Paul Bradshaw
 
PPTX
Neurodivergent Friendly Schools - Slides from training session
Pooky Knightsmith
 
PPTX
How to Create Odoo JS Dialog_Popup in Odoo 18
Celine George
 
PDF
The History of Phone Numbers in Stoke Newington by Billy Thomas
History of Stoke Newington
 
PDF
Knee Extensor Mechanism Injuries - Orthopedic Radiologic Imaging
Sean M. Fox
 
Biological Bilingual Glossary Hindi and English Medium
World of Wisdom
 
QNL June Edition hosted by Pragya the official Quiz Club of the University of...
Pragya - UEM Kolkata Quiz Club
 
0725.WHITEPAPER-UNIQUEWAYSOFPROTOTYPINGANDUXNOW.pdf
Thomas GIRARD, MA, CDP
 
ARAL-Orientation_Morning-Session_Day-11.pdf
JoelVilloso1
 
Horarios de distribución de agua en julio
pegazohn1978
 
care of patient with elimination needs.pptx
Rekhanjali Gupta
 
GRADE-3-PPT-EVE-2025-ENG-Q1-LESSON-1.pptx
EveOdrapngimapNarido
 
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
QUARTER 1 WEEK 2 PLOT, POV AND CONFLICTS
KynaParas
 
Stokey: A Jewish Village by Rachel Kolsky
History of Stoke Newington
 
DIGESTION OF CARBOHYDRATES,PROTEINS,LIPIDS
raviralanaresh2
 
Dimensions of Societal Planning in Commonism
StefanMz
 
PPT-Q1-WEEK-3-SCIENCE-ERevised Matatag Grade 3.pptx
reijhongidayawan02
 
MENINGITIS: NURSING MANAGEMENT, BACTERIAL MENINGITIS, VIRAL MENINGITIS.pptx
PRADEEP ABOTHU
 
How to Convert an Opportunity into a Quotation in Odoo 18 CRM
Celine George
 
Generative AI: it's STILL not a robot (CIJ Summer 2025)
Paul Bradshaw
 
Neurodivergent Friendly Schools - Slides from training session
Pooky Knightsmith
 
How to Create Odoo JS Dialog_Popup in Odoo 18
Celine George
 
The History of Phone Numbers in Stoke Newington by Billy Thomas
History of Stoke Newington
 
Knee Extensor Mechanism Injuries - Orthopedic Radiologic Imaging
Sean M. Fox
 
Ad

SQL Server Integration Services and Analysis Services

  • 1. SQL Server Integration Services and Analysis Services Mohan
  • 2. Consultant Certification Mohan Arumugam Technologies Specialist E-mail : [email protected] Phone : +91 99406 53876 Profile Blogger Trainer Who Am I ?
  • 3. Agenda • Overview of Integration Services • Principles of Good Package Design • Component Drilldown • Performance Tuning
  • 4. A DW architecture Datawarehouse (SQL Server, Oracle, DB2, Teradata) SQL/Oracle SAP/Dynamics Legacy Text XML Integration Services Reports Dashboards Scorecards Excel BI tools Analysis Services
  • 5. Session Objectives  Assumptions  Experience with SSIS and SSAS  Goals  Discuss design, performance, and scalability for building ETL packages and cubes (UDMs)  Best practices  Common mistakes
  • 6. SQL Server 2005 BPA availability!  BPA = Best Practice Analyzer  Utility that scans your SQL Server metadata and recommends best practices  Best practices from dev team and Customer Support Services  What’s new:  Support for SQL Server 2005  Support for Analysis Services and Integration Services  Scan scheduling  Auto update framework  CTP available now, RTM April  http://www.microsoft.com/downloads/details.aspx?FamilyId=DA0531E4-E94C-4991-82FA-F0E3FB D05E63&displaylang=en
  • 7. Agenda  Integration Services  Quick overview of IS  Principles of Good Package Design  Component Drilldown  Performance Tuning  Analysis Services  UDM overview  UDM design best practices  Performance tips
  • 8. What is SQL Server Integration Services?  Introduced in SQL Server 2005  The successor to Data Transformation Services  The platform for a new generation of high-performance data integration technologies
  • 9. Call center data: semi structured Legacy data: binary files Application database ETL Warehouse Reports Mobile data Data mining Alerts & escalation •Integration and warehousing require separate, staged operations. •Preparation of data requires different, often incompatible, tools. •Reporting and escalation is a slow process, delaying smart responses. •Heavy data volumes make this scenario increasingly unworkable. Hand coding Staging Text mining ETL Staging Cleansing & ETL Staging ETL ETL Objective: Before SSIS
  • 10. Call center: semi-structured data Legacy data: binary files Application database •Integration and warehousing are a seamless, manageable operation. •Source, prepare, and load data in a single, auditable process. •Reporting and escalation can be parallelized with the warehouse load. •Scales to handle heavy and complex data requirements. SQL Server Integration Services Text mining components Custom source Standard sources Data-cleansing components Merges Data mining components Warehouse Reports Mobile data Alerts & escalation Changing the Game with SSIS
  • 11. SSIS Architecture  Control Flow (Runtime)  A parallel workflow engine  Executes containers and tasks  Data Flow (“Pipeline”)  A special runtime task  A high-performance data pipeline  Applies graphs of components to data movement  Component can be sources, transformations or destinations  Highly parallel operations possible
  • 12. Agenda  Overview of Integration Services  Principles of Good Package Design  Component Drilldown  Performance Tuning
  • 13. Principles of Good Package Design - General  Follow Microsoft Development Guidelines  Iterative design, development & testing  Understand the Business  Understanding the people & processes are critical for success  Kimball’s “Data Warehouse ETL Toolkit” book is an excellent reference  Get the big picture  Resource contention, processing windows, …  SSIS does not forgive bad database design  Old principles still apply – e.g. load with/without indexes?  Platform considerations  Will this run on IA64 / X64?  No BIDS on IA64 – how will I debug?  Is OLE-DB driver XXX available on IA64?  Memory and resource usage on different platforms
  • 14. Principles of Good Package Design - Architecture  Process Modularity  Break complex ETL into logically distinct packages (vs. monolithic design)  Improves development & debug experience  Package Modularity  Separate sub-processes within package into separate Containers  More elegant, easier to develop  Simple to disable whole Containers when debugging  Component Modularity  Use Script Task/Transform for one-off problems  Build custom components for maximum re-use
  • 17. Principles of Good Package Design - Infrastructure  Use Package Configurations  Build it in from the start  Will make things easier later on  Simplify deployment Dev  QA  Production  Use Package Logging  Performance & debugging  Build in Security from the start  Credentials and other sensitive info  Package & Process IP  Configurations & Parameters
  • 18. Principles of Good Package Design - Development  SSIS is visual programming!  Use source code control system  Undo is not as simple in a GUI environment!  Improved experience for multi-developer environment  Comment your packages and scripts  In 2 weeks even you may forget a subtlety of your design  Someone else has to maintain your code  Use error-handling  Use the correct precedence constraints on tasks  Use the error outputs on transforms – store them in a table for processing later, or use downstream if the error can be handled in the package  Try…Catch in your scripts
  • 19. Component Drilldown - Tasks & Transforms  Avoid over-design  Too many moving parts is inelegant and likely slow  But don’t be afraid to experiment – there are many ways to solve a problem  Maximize Parallelism  Allocate enough threads  EngineThreads property on DataFlow Task  “Rule of thumb” - # of datasources + # of async components  Minimize blocking  Synchronous vs. Asynchronous components  Memcopy is expensive – reduce the number of asynchronous components in a flow if possible – example coming up later  Minimize ancillary data  For example, minimize data retrieved by LookupTx
  • 20. Debugging & Performance Tuning - General  Leverage the logging and auditing features  MsgBox is your friend  Pipeline debuggers are your friend  Use the throughput component from Project REAL  Experiment with different techniques  Use source code control system  Focus on the bottlenecks – methodology discussed later  Test on different platforms  32bit, IA64, x64  Local Storage, SAN  Memory considerations  Network & topology considerations
  • 21. Debugging & Performance Tuning - Volume  Remove redundant columns  Use SELECT statements as opposed to tables  SELECT * is your enemy  Also remove redundant columns after every async component!  Filter rows  WHERE clause is your friend  Conditional Split in SSIS  Concatenate or re-route unneeded columns  Parallel loading  Source system split source data into multiple chunks  Flat Files – multiple files  Relational – via key fields and indexes  Multiple Destination components all loading same table
  • 22. Debugging & Performance Tuning - Application  Is BCP good enough?  Overhead of starting up an SSIS package may offset any performance gain over BCP for small data sets  Is the greater manageability and control of SSIS needed?  Which pattern?  Many Lookup patterns possible – which one is most suitable?  See Project Real for examples of patterns: http://www.microsoft.com/sql/solutions/bi/projectreal.mspx  Which component?  Bulk Import Task vs. Data Flow  Bulk Import might give better performance if there are no transformations or filtering required, and the destination is SQL Server.  Lookup vs. MergeJoin (LeftJoin) vs. set based statements in SQL  MergeJoin might be required if you’re not able to populate the lookup cache.  Set based SQL statements might provide a way to persist lookup cache misses and apply a set based operation for higher performance.  Script vs. custom component  Script might be good enough for small transforms that’re typically not reused
  • 23. Case Study - Patterns 105 seconds 83 seconds Use Error Output for handling Lookup miss Ignore lookup errors and check for null looked up values in Derived Column
  • 24. Debugging & Performance Tuning – A methodology  Optimize and Stabilize the basics  Minimize staging (else use RawFiles if possible)  Make sure you have enough Memory  Windows, Disk, Network, …  SQL FileGroups, Indexing, Partitioning  Get Baseline  Replace destinations with RowCount  Source->RowCount throughput  Source->Destination throughput  Incrementally add/change components to see effect  This could include the DB layer  Use source code control!  Optimize slow components for resources available
  • 25. Case Study - Parallelism  Focus on critical path  Utilize available resources Memory Constrained Reader and CPU Constrained Let it rip! Optimize the slowest
  • 26. Summary  Follow best practice development methods  Understand how SSIS architecture influences performance  Buffers, component types  Design Patterns  Learn the new features  But do not forget the existing principles  Use the native functionality  But do not be afraid to extend  Measure performance  Focus on the bottlenecks  Maximize parallelism and memory use where appropriate  Be aware of different platforms capabilities (64bit RAM)  Testing is key
  • 28. Agenda Server architecture and UDM basics Optimizing the cube design Partitioning and Aggregations Processing Queries and calculations Conclusion
  • 30. Dimension An entity on which analysis is to be performed (e.g. Customers) Consists of: Attributes that describe the entity Hierarchies that organize dimension members in meaningful ways Customer ID First Name Last Name State City Marital Status Gender … Age 123 John Doe WA Seattle Married Male … 42 456 Lance Smith WA Redmond Unmarried Male … 34 789 Jill Thompson OR Portland Married Female … 21
  • 31. Attribute  Containers of dimension members.  Completely define the dimensional space.  Enable slicing and grouping the dimensional space in interesting ways.  Customers in state WA and age > 50  Customers who are married and male  Typically have one-many relationships  City  State, State  Country, etc.  All attributes implicitly related to the key
  • 32. Hierarchy Ordered collection of attributes into levels Navigation path through dimensional space User defined hierarchies – typically multiple levels Attribute hierarchies – implicitly created for each attribute – single level Customers by Geography Country State City Customer Customers by Demographics Marital Gender Customer
  • 34. Cube Collection of dimensions and measures Measure  numeric data associated with a set of dimensions (e.g. Qty Sold, Sales Amount, Cost) Multi-dimensional space Defined by dimensions and measures E.g. (Customers, Products, Time, Measures) Intersection of dimension members and measures is a cell (USA, Bikes, 2004, Sales Amount) = $1,523,374.83
  • 35. A Cube Product Peas Corn Bread Milk Beer M a r k e t Bos NYC Chi Sea Jan Mar Feb Time Units of beer sold in Boston in January
  • 36. Measure Group Group of measures with same dimensionality Analogous to fact table Cube can contain more than one measure group E.g. Sales, Inventory, Finance Multi-dimensional space Subset of dimensions and measures in the cube AS2000 comparison Virtual Cube  Cube Cube  Measure Group
  • 37. Sales Inventory Finance Customers X Products X X Time X X X Promotions X Warehouse X Department X Account X Scenario X Measure Group Measure Group
  • 38. Agenda  Server architecture and UDM Basics  Optimizing the cube design  Partitioning and Aggregations  Processing  Queries and calculations  Conclusion
  • 39. Top 3 Tenets of Good Cube Design Attribute relationships Attribute relationships Attribute relationships
  • 40. Attribute Relationships One-to-many relationships between attributes Server simply “works better” if you define them where applicable Examples: City  State, State  Country Day  Month, Month  Quarter, Quarter  Year Product Subcategory  Product Category Rigid v/s flexible relationships (default is flexible) Customer  City, Customer  PhoneNo are flexible Customer  BirthDate, City  State are rigid All attributes implicitly related to key attribute
  • 41. Customer City State Country Gender Marital Age Attribute Relationships (continued)
  • 43. Attribute Relationships Where are they used? MDX Semantics Tells the formula engine how to roll up measure values If the grain of the measure group is different from the key attribute (e.g. Sales by Month) Attribute relationships from grain to other attributes required (e.g. Month  Quarter, Quarter  Year) Otherwise no data (NULL) returned for Quarter and Year MDX Semantics explained in detail at: http://www.sqlserveranalysisservices.com/OLAPPapers/AttributeRelationships.htm
  • 44. Attribute Relationships Where are they used? Storage Reduces redundant relationships between dimension members – normalizes dimension storage Enables clustering of records within partition segments (e.g. store facts for a month together) Processing Reduces memory consumption in dimension processing – less hash tables to fit in memory Allows large dimensions to push 32-bit barrier Speeds up dimension and partition processing overall
  • 45. Attribute Relationships Where are they used? Query performance Dimension storage access is faster Produces more optimal execution plans Aggregation design Enables aggregation design algorithm to produce effective set of aggregations Dimension security DeniedSet = {State.WA} should deny cities and customers in WA – requires attribute relationships Member properties Attribute relationships identify member properties on levels
  • 46. Attribute Relationships How to set them up? Creating an attribute relationship is easy, but … Pay careful attention to the key columns! Make sure every attribute has unique key columns (add composite keys as needed) There must be a 1:M relation between the key columns of the two attributes Invalid key columns cause a member to have multiple parents Dimension processing picks one parent arbitrarily and succeeds Hierarchy looks wrong!
  • 47. Attribute Relationships How to set them up? Don’t forget to remove redundant relationships! All attributes start with relationship to key Customer  City  State  Country Customer  State (redundant) Customer  Country (redundant)
  • 48. Attribute Relationships Example Time dimension Day, Week, Month, Quarter, Year Year: 2003 to 2010 Quarter: 1 to 4 Month: 1 to 12 Week: 1 to 52 Day: 20030101 to 20101231 Day Week Month Quarter Year
  • 49. Attribute Relationships Example Time dimension Day, Week, Month, Quarter, Year Year: 2003 to 2010 Quarter: 1 to 4 - Key columns (Year, Quarter) Month: 1 to 12 Week: 1 to 52 Day: 20030101 to 20101231 Day Week Month Quarter Year
  • 50. Time dimension Day, Week, Month, Quarter, Year Year: 2003 to 2010 Quarter: 1 to 4 - Key columns (Year, Quarter) Month: 1 to 12 - Key columns (Year, Month) Week: 1 to 52 Day: 20030101 to 20101231 Attribute Relationships Example Day Week Month Quarter Year
  • 51. Time dimension Day, Week, Month, Quarter, Year Year: 2003 to 2010 Quarter: 1 to 4 - Key columns (Year, Quarter) Month: 1 to 12 - Key columns (Year, Month) Week: 1 to 52 - Key columns (Year, Week) Day: 20030101 to 20101231 Attribute Relationships Example Day Week Month Quarter Year
  • 53. User Defined Hierarchies Pre-defined navigation paths thru dimensional space defined by attributes Attribute hierarchies enable ad hoc navigation Why create user defined hierarchies then? Guide end users to interesting navigation paths Existing client tools are not “attribute aware” Performance Optimize navigation path at processing time Materialization of hierarchy tree on disk Aggregation designer favors user defined hierarchies
  • 54. Natural Hierarchies 1:M relation (via attribute relationships) between every pair of adjacent levels Examples: Country-State-City-Customer (natural) Country-City (natural) State-Customer (natural) Age-Gender-Customer (unnatural) Year-Quarter-Month (depends on key columns) How many quarters and months? 4 & 12 across all years (unnatural) 4 & 12 for each year (natural)
  • 55. Natural Hierarchies Best Practice for Hierarchy Design Performance implications Only natural hierarchies are materialized on disk during processing Unnatural hierarchies are built on the fly during queries (and cached in memory) Server internally decomposes unnatural hierarchies into natural components Essentially operates like ad hoc navigation path (but somewhat better) Create natural hierarchies where possible Using attribute relationships Not always appropriate (e.g. Age-Gender)
  • 56. Best Practices for Cube Design Dimensions Consolidate multiple hierarchies into single dimension (unless they are related via fact table) Avoid ROLAP storage mode if performance is key Use role playing dimensions (e.g. OrderDate, BillDate, ShipDate) - avoids multiple physical copies Use parent-child dimensions prudently No intermediate level aggregation support Use many-to-many dimensions prudently Slower than regular dimensions, but faster than calculations Intermediate measure group must be “small” relative to primary measure group
  • 57. Best Practices for Cube Design Attributes Define all possible attribute relationships! Mark attribute relationships as rigid where appropriate Use integer (or numeric) key columns Set AttributeHierarchyEnabled to false for attributes not used for navigation (e.g. Phone#, Address) Set AttributeHierarchyOptimizedState to NotOptimized for infrequently used attributes Set AttributeHierarchyOrdered to false if the order of members returned by queries is not important Hierarchies Use natural hierarchies where possible
  • 58. Best Practices for Cube Design Measures Use smallest numeric data type possible Use semi-additive aggregate functions instead of MDX calculations to achieve same behavior Put distinct count measures into separate measure group (BIDS does this automatically) Avoid string source column for distinct count measures
  • 59. Agenda  Server architecture and UDM Basics  Optimizing the cube design  Partitioning and Aggregations  Processing  Queries and calculations  Conclusion
  • 60. Partitioning Mechanism to break up large cube into manageable chunks Partitions can be added, processed, deleted independently Update to last month’s data does not affect prior months’ partitions Sliding window scenario easy to implement E.g. 24 month window  add June 2006 partition and delete June 2004 Partitions can have different storage settings Partitions require Enterprise Edition!
  • 61. Benefits of Partitioning Partitions can be processed and queried in parallel Better utilization of server resources Reduced data warehouse load times Queries are isolated to relevant partitions  less data to scan SELECT … FROM … WHERE [Time].[Year].[2006] Queries only 2006 partitions Bottom line  partitions enable: Manageability Performance Scalability
  • 62. Best Practices for Partitioning No more than 20M rows per partition Specify partition slice Optional for MOLAP – server auto-detects the slice and validates against user specified slice (if any) Must be specified for ROLAP Manage storage settings by usage patterns Frequently queried  MOLAP with lots of aggs Periodically queried  MOLAP with less or no aggs Historical  ROLAP with no aggs Alternate disk drive - use multiple controllers to avoid I/O contention
  • 63. Best Practices for Aggregations Define all possible attribute relationships Set accurate attribute member counts and fact table counts Set AggregationUsage to guide agg designer Set rarely queried attributes to None Set commonly queried attributes to Unrestricted Do not build too many aggregations In the 100s, not 1000s! Do not build aggregations larger than 30% of fact table size (agg design algorithm doesn’t)
  • 64. Best Practices for Aggregations Aggregation design cycle Use Storage Design Wizard (~20% perf gain) to design initial set of aggregations Enable query log and run pilot workload (beta test with limited set of users) Use Usage Based Optimization (UBO) Wizard to refine aggregations Use larger perf gain (70-80%) Reprocess partitions for new aggregations to take effect Periodically use UBO to refine aggregations
  • 65. Agenda  Server architecture and UDM Basics  Optimizing the cube design  Partitioning and Aggregations  Processing  Queries and calculations  Conclusion
  • 66. Improving Processing  SQL Server Performance Tuning  Improve the queries that are used for extracting data from SQL Server  Check for proper plans and indexing  Conduct regular SQL performance tuning process  AS Processing Improvements  Use SP2 !!  Processing 20 partitions: SP1 1:56, SP2: 1:06  Don’t let UI default for parallel processing  Go into advanced processing tab and change it  Monitor the values:  Maximum number of datasource connections  MaxParallel – How many partitions processed in parallel, don’t let the server decide on its own.  Use INT for keys, if possible. Parallel processing requires Enterprise Edition!
  • 67. Improving Processing  For best performance use ASCMD.EXE and XMLA  Use <Parallel> </Parallel> to group processing tasks together until Server is using maximum resources  Proper use of <Transaction> </Transaction>  ProcessFact and ProcessIndex separately instead of ProcessFull (for large partitions)  Consumes less memory.  ProcessClearIndexes deletes existing indexes and ProcessIndexes generates or reprocesses
  • 68. Best Practices for Processing Partition processing Monitor aggregation processing spilling to disk (perfmon counters for temp file usage) Add memory, turn on /3GB, move to x64/ia64 Fully process partitions periodically Achieves better compression over repeated incremental processing Data sources Avoid using .NET data sources – OLEDB is faster for processing
  • 69. Agenda  Server architecture  UDM Basics  Optimizing the cube design  Partitioning and Aggregations  Processing  Queries and calculations  Conclusion
  • 70. Non_Empty_Behavior Most client tools (Excel, Proclarity) display non empty results – eliminate members with no data With no calculations, non empty is fast – just checks fact data With calculations, non empty can be slow – requires evaluating formula for each cell Non_Empty_Behavior allows non empty on calculations to just check fact data Note: query processing hint – use with care! Create Member [Measures].[Internet Gross Profit] As [Internet Sales Amount] - [Internet Total Cost], Format_String = "Currency", Non_Empty_Behavior = [Internet Sales Amount];
  • 71. Auto-Exists Attributes/hierarchies within a dimension are always existed together City.Seattle * State.Members returns {(Seattle, WA)} (Seattle, OR), (Seattle, CA) do not “exist” Exploit the power of auto-exists Use Exists/CrossJoin instead of .Properties – faster Requires attribute hierarchy enabled on member property Filter(Customer.Members, Customer.CurrentMember.Properties(“Gender”) = “Male”) Exists(Customer.Members, Gender.[Male])
  • 72. Conditional Statement: IIF Use scopes instead of conditions such as Iif/Case Scopes are evaluated once statically Conditions are evaluated dynamically for each cell Always try to coerce IIF for one branch to be null Create Member Measures.Sales As Iif(Currency.CurrentMember Is Currency.USD, Measures.SalesUSD, Measures.SalesUSD * Measures.XRate); Create Member Measures.Sales As Null; Scope(Measures.Sales, Currency.Members); This = Measures.SalesUSD * Measures.XRate; Scope(Currency.USA); This = Measures.SalesUSD; End Scope; End Scope;
  • 73. Best Practices for MDX Use calc members instead of calc cells where possible Use .MemberValue for calcs on numeric attributes Filter(Customer.members, Salary.MemberValue > 100000) Avoid redundant use of .CurrentMember and .Value (Time.CurrentMember.PrevMember, Measures.CurrentMember ).Value can be replaced with Time.PrevMember Avoid LinkMember, StrToSet, StrToMember, StrToValue Replace simple calcs with computed columns in DSV Calculation done at processing time is always better Many more at: Analysis Services Performance Whitepaper: http://download.microsoft.com/download/8/5/e/85eea4fa-b3bb-4426-97d0-7f7151b2011c/SSAS2005P erfGuide.doc http://sqljunkies.com/weblog/mosha http://sqlserveranalysisservices.com
  • 74. Conclusion AS2005 is major re-architecture from AS2000 Design for perf & scalability from the start Many principles carry through from AS2000 Dimensional design, Partitioning, Aggregations Many new principles in AS2005 Attribute relationships, natural hierarchies New design alternatives – role playing, many-to-many, reference dimensions, semi-additive measures Flexible processing options MDX scripts, scopes Use Analysis Services with SQL Server Enterprise Edition to get max performance and scale
  • 75. Resources  SSIS  SQL Server Integration Services site – links to blogs, training, partners, etc.: http://msdn.microsoft.com/SQL/sqlwarehouse/SSIS/default.aspx  SSIS MSDN Forum: http://forums.microsoft.com/MSDN/ShowForum.aspx?ForumID=80&SiteID=1  SSIS MVP community site: http://www.sqlis.com  SSAS  BLOGS: http://blogs.msdn.com/sqlcat  PROJECT REAL-Business Intelligence in Practice  Analysis Services Performance Guide  TechNet: Analysis Services for IT Professionals  Microsoft BI  SQL Server Business Intelligence public site: http://www.microsoft.com/sql/evaluation/bi/default.asp  http://www.microsoft.com/bi
  • 76. Mohan Arumugam Technologies Specialist E-mail : [email protected] Phone : +91 99406 53876 Profile Thank You

Editor's Notes

  • #4: TPs: Datawarehouse as a central repository of the data allows business monitoring, analysis, planning etc. It is a bridge between numerous heterogenous data sources in the organization and the tools (reports, excel, scorecards and dashboards) that end users need to get business insight. Having the right data, cleaned and conformed is the key to successful BI implementations and ROI for the business. However (animate integration services) it takes effort to bring the data in. You need a powerful tool to transform, clean, and load the data. This is where Integration Services comes into the picture. It is our tool, that you get as part of SQL Server, that allows you to get your data right before it is stored in the DW. Now that you have the data in the data warehouse, the work is not done. While you probably have all of the data there, the data is not readily consumable by end users. You need to make it business friendly – define common business entities (dimensions and measures), common business calculations (variance to budget, profit, etc), metrics (KPIs), and present the data to users in a format that is easy understandable (end user friendly names, hierarchies etc). Animate analysis services. This is where AS comes handy. It makes the data in DW much easier to consume by a variety of BI tools and eventually your end users. SQL Server has a very powerful offering for building data warehouses. Not only that you get SQL relational database system, but you also get in the same box AS and IS to implement end to end DW environment.
  • #13: Also mention Building Data Warehouses with Microsoft SQL Server BIDS only runs on x86 due to the fact that VS.Net only supports this platform. On x64 we’re still OK since BIDS runs in the WoW. However on Itanium the message is if you’re using Itanium then ‘develop on x86, deploy on IA64’. In other words dev would use an x86 box, but staging and production would be Itanium. The next issue is that specific OleDb drivers may be available on x86 but not on any of the 64bit platforms. In these cases we suggest either using a 3rd party driver provider (such as ETI) or split your packages up such that the extraction is done on a 32bit box which has the driver, and then process the rest of the solution on 64. We have many partners who have excellent drivers on 64bit platforms though
  • #14: Process modularity – multiple people can work on it Containers are objects in SQL Server 2005 Integration Services (SSIS) that provide structure to packages and services to tasks. They support repeating control flows in packages, and they group tasks and containers into meaningful units of work. Containers can include other containers in addition to tasks. Packages use containers for the following purposes: Repeat tasks for each element in a collection, such as files in a folder, schemas, or SQL Management Objects (SMO) objects. For example, a package can run Transact-SQL statements that reside in multiple files. Repeat tasks until a specified expression evaluates to false. For example, a package can send a different e-mail message seven times, one time for every day of the week. Group tasks and containers that must succeed or fail as a unit. For example, a package can group tasks that delete and add rows in a database table, and then commit or roll back all the tasks when one fails. The Sequence container defines a control flow that is a subset of the package control flow. Sequence containers group the package into multiple separate control flows, each containing one or more tasks and containers that run within the overall package control flow. There are many benefits of using a Sequence container: Disabling groups of tasks to focus package debugging on one subset of the package control flow. Managing properties on multiple tasks in one location by setting properties on a Sequence container instead of on the individual tasks. Providing scope for variables that a group of related tasks and containers use.
  • #17: Demo? – show how to set up performance logs for SSIS Using package configurations can provide the following benefits: Configurations make it easier to move packages from a development environment to a production environment. For example, a configuration can update the connection string in the connection manager that the package uses. Configurations are useful when you deploy packages to many different servers. For example, a variable in the configuration for each deployed package can contain a different disk space threshold, under which the package does not run. Configurations make packages more flexible. For example, a configuration can update the value of a variable that is used in a property expression. Logging - Different providers – SQL profiler, SQL table, text, xml, windows log Demo: Monitoring Performance of the Data Flow Engine Control Panel -> Admin Tools -> Performance -> Perf counters -> New Counter Log Buffer Memory – The amount of memory buffers currently in use. If bigger than amount of physical memory, swapping occurs -> performance goes down Rows read – the number of rows that a source produces (doesn’t include rows from reference tables using lookup) Rows written – the number of rows offered to a destination. Does not reflect rows written to the destination data store Security Package protection level (don’t save sensitive, encrypt all with password, encrypt all with user key, encrypt sensitive with password/user key, rely on SQL Server storage – db roles) Sensitive = password part of the connection string, variables that are marked as sensitive
  • #18: Precendence constraints – you can put a function on a constraint (if previous task succeeds and some function returns true then go execute next task)
  • #19: In DTS it was hard to do parallel processing Synchronous – row goes in, gets worked on, goes out Asynchronous – ex. Aggregate, Sort, blocks until all rows are read in. As soon as you have Asynchronous component, means you are doing memcopy, it is expensive, and will slow down your package If you use Lookup transform, avoid doing SELECT *, just get the data that you need
  • #20: Throughput component - Project REAL is downloadable from the ms website, it is a sample end-to-end BI solution built on actual B&N data. We built a throughput script for this which measures the stats of min/max/avg rows per second going through any specific part of a package. There are also several Project REAL whitepapers
  • #21: Do in SQL: Type conversions, null coercing, coalescing, data type sharpening select nullif(name, ‘’) from contacts order by 1 select convert(tinyint, code) from sales
  • #22: This slide provides more in depth info for slide 17
  • #24: SQL features - Utilise these other technolgies as much as possible. For instance if the SSIS insertion speed is slow, maybe its actually the fact that the DB should be utilising filegroups better, or maybe you need partitioned tables, etc. Comes back to the earlier statement of right tool for the right job - every piece of the solution needs to be well oiled in order for it to run smoothly, and SSIS can't fix bad DB design problems
  • #25: Memory Constrained Optimized for low memory. Aggregate transform internally tries to create minimum memory when calculating aggregates. If one aggregation can be created from another, only the base aggregation is stored. Reader and CPU constrained: optimized for a slow reader and reduced CPU constraints – all the aggregates will execute on the same thread. Let it rip: completely parallellized operation – all sources, destinations, aggregations occur in parallel, with no constraints on memory. However, some of the aggregations might be significantly more expensive to calculate than others, and the final option (Optimize the slowest) balances the use of “Reader and CPU constrained” for the left side and introduces a UnionAll to give a thread each to the aggregate on the right.
  • #29: Client components are simple XMLA wrappers No client side logic for query evaluation, processing Client side caching limited to metadata – no caching of data, formulas, dimension members Pros Less demanding on client for memory and CPU Better utilization of server resources Query performance far less affected by network latency Client components scale to middle tier workload Cons All queries require server roundtrip – could be slower than AS2000 where queries were answered by client cache More load on the server – can affect throughput in multi-user scenario
  • #63: Examines the AggregationUsage property to build list of candidate attributes Full: Every agg must include the attribute None: No agg can include the attribute Unrestricted: No restrictions on the algorithm Default: Unrestricted if attribute is All, key or belongs to a natural hierarchy, None otherwise
  • #66: Example: <Batch> .... <Parallel maxParallel="Integer"> <!-- One or more XMLA commands --> </Parallel> .... </Batch> AS SP2 processing improvements - Implemented dimension property caching which is why processing is faster. Processing improvement usually seen for >2 processors. Recommendation: Use integer keys maxParallel is an optional Integer attribute. Indicates the maximum number of threads on which to run commands in parallel. If not specified or set to 0, the instance of Microsoft SQL Server Analysis Services determines an optimal number of threads based on the number of processors available on the computer. Recommendation between 0.5 to 1.0 times the number of cpus or else thrashing can occur.     
  • #67: Many threads are throttled by the max number of datasource connections. Check the limit on the datasource. I suggest using MaxParallel parameter to control concurrency when processing in parallel, rather than let the system decide.   Do ProcessFact and ProcessIndex separately. They behave differently so we will learn more when done separately. Some customers do 2 steps because they are done using the relational database faster. Also ProcessFacts uses about 1.5 CPUs (roughly) and ProcessIndex will use more CPUs to build the aggregations. Depends on how many aggregations you have. So we expect ProcessFacts to scale well, and ProcessIndex to use all the CPUs. NUMA units are percentage of min(physical,virtual) or bytes if > 100. Allocations are distributed across all NUMA nodes, and will try to use large OS pages. Note that getting large pages from the OS can take a long time, up to a few minutes for huge machines.   I would start with a value corresponding to the amount of memory used in steady state or 50% of memory used during processing. Then go up to something like 80% of memory used during processing. You want to pre-allocate enough to avoid the problem, but not steal from other processes (the relational database).   You can get feedback for the amount actually allocated by using a debug monitor (dbmon.exe, or get one from http://www.sysinternals.com). ProcessClearIndexes - not in the docs, clears indexes; when you need it: if index is invalid, ProcessIndexes won't do anything because it sees them as being already created