SlideShare a Scribd company logo
Data Mining in SQL Server 2008Ing Eduardo CastroGrupoAsesor en Informáticaecastro@grupoasesor.net
Eduardo Castroecastro@grupoasesor.netMCITP Server AdministratorMCTS Windows Server 2008 ActiveDirectoryMCTS Windows Server 2008 Network InfrastructureMCTS Windows Server 2008 Applications InfrastructureMCITP Enterprise SupportMCSTS Windows VistaMCITP Database DeveloperMCITP Database AdministratorMCTS SQL ServerMCITP Exchange Server 2007MCTS Office PerformancePoint ServerMCTS Team Foundation ServerMCPD Enterprise Application DeveloperMCTS .Net Framework 2.0: Distributed ApplicationsMCT 2008International Association of  Software Architects Chapter LeaderIEEE Communications Society Board of DirectorsEuropean Datawarehouse Research
DisclaimerThe information contained in this slide deck represents the current view of Microsoft Corporation on the issues discussed as of the date of publication.  Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication.This slide deck is for informational purposes only.  MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS DOCUMENT.Complying with all applicable copyright laws is the responsibility of the user.  Without limiting the rights under copyright, no part of this slide deck may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation. Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this slide deck.  Except as expressly provided in any written license agreement from Microsoft, the furnishing of this slide deck does not give you any license to these patents, trademarks, copyrights, or other intellectual property.Unless otherwise noted, the example companies, organizations, products, domain names, e-mail addresses, logos, people, places and events depicted herein are fictitious, and no association with any real company, organization, product, domain name, email address, logo, person, place or event is intended or should be inferred.  © 2008 Microsoft Corporation.  All rights reserved.Microsoft, SQL Server, Office System, Visual Studio, SharePoint Server, Office PerformancePoint Server, .NET Framework, ProClarity Desktop Professionalare either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries.The names of actual companies and products mentioned herein may be the trademarks of their respective owners.3
OverviewIntroducing Data Mining Office Add-InsUnderstanding Data Mining Structure ImprovementsUsing the New Time Series Algorithm4
Introducing Data Mining Office Add-InsData Preparation TasksTools for ExplorationTools for PredictionModel Testing and Validation
Data Preparation Tasks
Tools for Exploration - Table Analysis Tools7
Tools for Exploration - Data Modeling Tools
Tools for Exploration – Model Viewers Cluster DiagramDistribution of population
Strength of similarities between clustersOther viewers: Decision tree
 Neural  network
 Association   rules
 Time seriesCluster ProfilesDistribution of values for each attribute
Drill through to detailsCluster CharacteristicsAttributes ordered by importance to cluster
Probability attribute appearing in clusterCluster DiscriminationComparison of attributes between two clustersTools for Prediction - Table Analysis Tools
Tools for Prediction - Data Modeling Tools
Model Testing and ValidationAccuracy ChartMeasurement of model accuracy
Lift chart comparing actual results to random guess and to perfect predictionClassification MatrixShows correct and incorrect predictions
Displays percentage and countsProfit ChartEstimation of profit by percentage of population contacted
Input: population, fixed cost, individual cost, revenue per individual
Output: maximum profit, probability thresholdCross Validation – more on this later
1 Using the Data Mining Excel Add-Indemo
Understanding Data Mining Structure ImprovementsData Partitioning for Training and TestingMining Model Column AliasesData Mining FiltersDrill Through to Mining Structure DataCross-Validation of a Mining Model
Data Partitioning for Training and TestingSpecify as percentage or maximum number of casesSmaller value is used if both parameters specifiedData is divided randomly between training and testingHoldoutSeed property enables consistent partitions across structures
Data Partitioning with DMXCreate a structure with partitioning with the HOLDOUT keywordQuery the structure to review partitions
Mining Model Column AliasesAssign a column alias to reuse a column in a structureColumn content can be clarifiedColumn can be more easily referenced in DMXContinuous and discretized versions of the same column can be used in separate models in the same structure
Data Mining FiltersSpecify a condition to apply to mining structure columns Filter creates subsets of training and testing data for a modelMultiple conditions can be linked with AND/OR operatorsConditions for continuous value use > , >=,  <, <= operatorsConditions for discrete values use =, !=, or is null operatorsConditions on nested tables can use EXISTS keyword and subquery
Data Mining Filters with DMXAdd a filtered mining model to a structure
Drill Through to Mining Structure  DataAdd columns to the mining structure, but not to modelsEliminates unnecessary data from model and improves processing timeSupports drill through from mining model viewer or DMX for visibility into results
Cross-Validation of a Mining ModelPurposeValidate the accuracy of a single modelCompare models within the same mining structureProcessSplit mining structure into partitions of equal sizeIteratively build models on all partitions excluding one partition such that all partitions are excluded onceMeasure accuracy of each model using the excluded partitionAnalyze results
Cross-Validation ParametersFold CountNumber of partitions to useMinimum 2, Maximum 256Maximum 10 for session mining structureMax CasesTotal number of cases to include in cross-validationCases divided across foldsValue of 0 specifies all casesTarget AttributePredictable column Target StateTarget value for target attributeValue of null specifies all states are to be testedTarget ThresholdValue between 0 and 1 for prediction probability above which a predicted state is considered correctValue of null specifies most probable prediction is considered correct

More Related Content

PDF
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks
 
PDF
Building Data Lakes with Apache Airflow
Gary Stafford
 
PDF
Unified Data Analytics: Helping Data Teams Solve the World’s Toughest Problems
Databricks
 
PPTX
Managing your ML lifecycle with Azure Databricks and Azure ML
Parashar Shah
 
PPTX
Azure data bricks by Eugene Polonichko
Alex Tumanoff
 
PDF
Microsoft Build 2020: Data Science Recap
Mark Tabladillo
 
PDF
201905 Azure Databricks for Machine Learning
Mark Tabladillo
 
PPTX
Sparkflows - Build E2E Data Analytics Use Cases in less than 30 mins
sparkflows
 
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks
 
Building Data Lakes with Apache Airflow
Gary Stafford
 
Unified Data Analytics: Helping Data Teams Solve the World’s Toughest Problems
Databricks
 
Managing your ML lifecycle with Azure Databricks and Azure ML
Parashar Shah
 
Azure data bricks by Eugene Polonichko
Alex Tumanoff
 
Microsoft Build 2020: Data Science Recap
Mark Tabladillo
 
201905 Azure Databricks for Machine Learning
Mark Tabladillo
 
Sparkflows - Build E2E Data Analytics Use Cases in less than 30 mins
sparkflows
 

What's hot (20)

PPTX
Sparkflows.io
sparkflows
 
PPTX
A developer's introduction to big data processing with Azure Databricks
Microsoft Tech Community
 
PDF
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Databricks
 
PDF
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Databricks
 
PDF
Using Redash for SQL Analytics on Databricks
Databricks
 
PDF
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
Databricks
 
PDF
Azure Synapse 101 Webinar Presentation
Matthew W. Bowers
 
PDF
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Databricks
 
PDF
Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
Databricks
 
PDF
Data Lakes with Azure Databricks
Data Con LA
 
PDF
Scalable AutoML for Time Series Forecasting using Ray
Databricks
 
PDF
Real-Time Analytics and Actions Across Large Data Sets with Apache Spark
Databricks
 
PPTX
Dataminds - ML in Production
Nathan Bijnens
 
PDF
Automated Production Ready ML at Scale
Databricks
 
PDF
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
Databricks
 
PPTX
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
Microsoft Tech Community
 
PDF
Data pipeline and data lake for autonomous driving
Yu Huang
 
PDF
Spark Summit EU talk by Pat Patterson
Spark Summit
 
PDF
Big Data Adavnced Analytics on Microsoft Azure
Mark Tabladillo
 
PDF
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Vasu S
 
Sparkflows.io
sparkflows
 
A developer's introduction to big data processing with Azure Databricks
Microsoft Tech Community
 
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Databricks
 
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Databricks
 
Using Redash for SQL Analytics on Databricks
Databricks
 
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
Databricks
 
Azure Synapse 101 Webinar Presentation
Matthew W. Bowers
 
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Databricks
 
Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
Databricks
 
Data Lakes with Azure Databricks
Data Con LA
 
Scalable AutoML for Time Series Forecasting using Ray
Databricks
 
Real-Time Analytics and Actions Across Large Data Sets with Apache Spark
Databricks
 
Dataminds - ML in Production
Nathan Bijnens
 
Automated Production Ready ML at Scale
Databricks
 
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
Databricks
 
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
Microsoft Tech Community
 
Data pipeline and data lake for autonomous driving
Yu Huang
 
Spark Summit EU talk by Pat Patterson
Spark Summit
 
Big Data Adavnced Analytics on Microsoft Azure
Mark Tabladillo
 
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Vasu S
 
Ad

Viewers also liked (20)

PPTX
Introducción al análisis predictivo con SQL Server
Eduardo Castro
 
PPTX
Palestra sobre Microsoft Business Intelligence para estudantes de Mogi-Guaçu ...
Heber Lopes
 
PPTX
Charla sql server 2012 cibertec BI
dbLearner
 
PPTX
SQL Denali Microsoft BI Raona
Raona
 
PPTX
SPS-Power BI Introduction
Kerry Dirks MCPS MS
 
PPTX
SQL Server 2014 - Power BI
BILATAM
 
PPTX
Sql Server Business Intelligence Spanish
Eduardo Castro
 
PDF
MCSE Productivity
Serhad MAKBULOĞLU, MBA
 
PPTX
Inteligencia de Negocios con PowerView
Eduardo Castro
 
PPTX
Welcome to PowerBI and Tableau
Ashwin Dinoriya
 
PPTX
Paweł Ciepły: PowerBI part1
AnalyticsConf
 
PDF
MCSA: Windows Server 2016
Serhad MAKBULOĞLU, MBA
 
PPTX
Power bi desktop et Power BI Service
Sophie Marchand, M.Sc., CPA, CGA, MVP
 
PPTX
Business analyst ppt
Yaswanth Babu Gummadivelli
 
PPTX
Power BI
Stéphane Fréchette
 
PDF
Microsoft Power BI Overview
Netwoven Inc.
 
PPTX
Power BI Made Simple
James Serra
 
PPTX
Introduction to Microsoft Power BI
Exilesoft
 
PDF
Business Analysis Fundamentals
waelsaid75
 
PPT
Business Analysis Techniques
IIBA UK Chapter
 
Introducción al análisis predictivo con SQL Server
Eduardo Castro
 
Palestra sobre Microsoft Business Intelligence para estudantes de Mogi-Guaçu ...
Heber Lopes
 
Charla sql server 2012 cibertec BI
dbLearner
 
SQL Denali Microsoft BI Raona
Raona
 
SPS-Power BI Introduction
Kerry Dirks MCPS MS
 
SQL Server 2014 - Power BI
BILATAM
 
Sql Server Business Intelligence Spanish
Eduardo Castro
 
MCSE Productivity
Serhad MAKBULOĞLU, MBA
 
Inteligencia de Negocios con PowerView
Eduardo Castro
 
Welcome to PowerBI and Tableau
Ashwin Dinoriya
 
Paweł Ciepły: PowerBI part1
AnalyticsConf
 
MCSA: Windows Server 2016
Serhad MAKBULOĞLU, MBA
 
Power bi desktop et Power BI Service
Sophie Marchand, M.Sc., CPA, CGA, MVP
 
Business analyst ppt
Yaswanth Babu Gummadivelli
 
Microsoft Power BI Overview
Netwoven Inc.
 
Power BI Made Simple
James Serra
 
Introduction to Microsoft Power BI
Exilesoft
 
Business Analysis Fundamentals
waelsaid75
 
Business Analysis Techniques
IIBA UK Chapter
 
Ad

Similar to Minería de Datos en Sql Server 2008 (20)

PPT
Excel Datamining Addin Intermediate
DataminingTools Inc
 
PPT
Excel Datamining Addin Intermediate
excel content
 
PPTX
Data Mining With SQL Server
Hoan Phuc
 
PPT
Data Mining 2008
llangit
 
PPT
SQL Server 2008 Data Mining
llangit
 
PPT
Data Mining for Developers
llangit
 
PPT
SQL Server 2008 Data Mining
llangit
 
PPT
SQL Server 2008 Data Mining
llangit
 
PPT
BI 2008 Simple
llangit
 
PPTX
Data mining by example forecasting and cross prediction using microsoft time ...
Shaoli Lu
 
PPT
Datamining
IssacArputharajJeyak
 
PPT
Datamining
IssacArputharajJeyak
 
PDF
Data Mining with Excel 2010 and PowerPivot 201106
Mark Tabladillo
 
PPT
Excel Datamining Addin Beginner
DataminingTools Inc
 
PPT
Excel Datamining Addin Beginner
excel content
 
PPTX
Data Mining with SQL Server 2008
Peter Gfader
 
PPTX
Introduction To Sql Server Data Mining
Hugo Olivera Alonso
 
PPTX
3510-6510_Ch4.pptx
Pak Tari
 
PDF
Microsoft Data Mining 2012
Mark Ginnebaugh
 
PPTX
MS SQL SERVER: Microsoft time series algorithm
sqlserver content
 
Excel Datamining Addin Intermediate
DataminingTools Inc
 
Excel Datamining Addin Intermediate
excel content
 
Data Mining With SQL Server
Hoan Phuc
 
Data Mining 2008
llangit
 
SQL Server 2008 Data Mining
llangit
 
Data Mining for Developers
llangit
 
SQL Server 2008 Data Mining
llangit
 
SQL Server 2008 Data Mining
llangit
 
BI 2008 Simple
llangit
 
Data mining by example forecasting and cross prediction using microsoft time ...
Shaoli Lu
 
Data Mining with Excel 2010 and PowerPivot 201106
Mark Tabladillo
 
Excel Datamining Addin Beginner
DataminingTools Inc
 
Excel Datamining Addin Beginner
excel content
 
Data Mining with SQL Server 2008
Peter Gfader
 
Introduction To Sql Server Data Mining
Hugo Olivera Alonso
 
3510-6510_Ch4.pptx
Pak Tari
 
Microsoft Data Mining 2012
Mark Ginnebaugh
 
MS SQL SERVER: Microsoft time series algorithm
sqlserver content
 

More from Eduardo Castro (20)

PPTX
Introducción a polybase en SQL Server
Eduardo Castro
 
PPTX
Creando tu primer ambiente de AI en Azure ML y SQL Server
Eduardo Castro
 
PPTX
Seguridad en SQL Azure
Eduardo Castro
 
PPTX
Azure Synapse Analytics MLflow
Eduardo Castro
 
PPTX
SQL Server 2019 con Windows Server 2022
Eduardo Castro
 
PPTX
Novedades en SQL Server 2022
Eduardo Castro
 
PPTX
Introduccion a SQL Server 2022
Eduardo Castro
 
PPTX
Machine Learning con Azure Managed Instance
Eduardo Castro
 
PPTX
Novedades en sql server 2022
Eduardo Castro
 
PDF
Sql server 2019 con windows server 2022
Eduardo Castro
 
PDF
Introduccion a databricks
Eduardo Castro
 
PDF
Pronosticos con sql server
Eduardo Castro
 
PDF
Data warehouse con azure synapse analytics
Eduardo Castro
 
PPTX
Que hay de nuevo en el Azure Data Lake Storage Gen2
Eduardo Castro
 
PPTX
Introduccion a Azure Synapse Analytics
Eduardo Castro
 
PPTX
Seguridad de SQL Database en Azure
Eduardo Castro
 
PPTX
Python dentro de SQL Server
Eduardo Castro
 
PDF
Servicios Cognitivos de de Microsoft
Eduardo Castro
 
TXT
Script de paso a paso de configuración de Secure Enclaves
Eduardo Castro
 
PDF
Introducción a conceptos de SQL Server Secure Enclaves
Eduardo Castro
 
Introducción a polybase en SQL Server
Eduardo Castro
 
Creando tu primer ambiente de AI en Azure ML y SQL Server
Eduardo Castro
 
Seguridad en SQL Azure
Eduardo Castro
 
Azure Synapse Analytics MLflow
Eduardo Castro
 
SQL Server 2019 con Windows Server 2022
Eduardo Castro
 
Novedades en SQL Server 2022
Eduardo Castro
 
Introduccion a SQL Server 2022
Eduardo Castro
 
Machine Learning con Azure Managed Instance
Eduardo Castro
 
Novedades en sql server 2022
Eduardo Castro
 
Sql server 2019 con windows server 2022
Eduardo Castro
 
Introduccion a databricks
Eduardo Castro
 
Pronosticos con sql server
Eduardo Castro
 
Data warehouse con azure synapse analytics
Eduardo Castro
 
Que hay de nuevo en el Azure Data Lake Storage Gen2
Eduardo Castro
 
Introduccion a Azure Synapse Analytics
Eduardo Castro
 
Seguridad de SQL Database en Azure
Eduardo Castro
 
Python dentro de SQL Server
Eduardo Castro
 
Servicios Cognitivos de de Microsoft
Eduardo Castro
 
Script de paso a paso de configuración de Secure Enclaves
Eduardo Castro
 
Introducción a conceptos de SQL Server Secure Enclaves
Eduardo Castro
 

Recently uploaded (20)

PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PPTX
ChatGPT's Deck on The Enduring Legacy of Fax Machines
Greg Swan
 
PDF
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
PDF
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
PDF
This slide provides an overview Technology
mineshkharadi333
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
Software Development Company | KodekX
KodekX
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PDF
Beyond Automation: The Role of IoT Sensor Integration in Next-Gen Industries
Rejig Digital
 
PPTX
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
Francisco Vieira Júnior
 
PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PDF
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PPTX
How to Build a Scalable Micro-Investing Platform in 2025 - A Founder’s Guide ...
Third Rock Techkno
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
ChatGPT's Deck on The Enduring Legacy of Fax Machines
Greg Swan
 
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
This slide provides an overview Technology
mineshkharadi333
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Software Development Company | KodekX
KodekX
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
Beyond Automation: The Role of IoT Sensor Integration in Next-Gen Industries
Rejig Digital
 
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
Francisco Vieira Júnior
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
How to Build a Scalable Micro-Investing Platform in 2025 - A Founder’s Guide ...
Third Rock Techkno
 

Minería de Datos en Sql Server 2008

  • 1. Data Mining in SQL Server 2008Ing Eduardo CastroGrupoAsesor en Informá[email protected]
  • 2. Eduardo [email protected] Server AdministratorMCTS Windows Server 2008 ActiveDirectoryMCTS Windows Server 2008 Network InfrastructureMCTS Windows Server 2008 Applications InfrastructureMCITP Enterprise SupportMCSTS Windows VistaMCITP Database DeveloperMCITP Database AdministratorMCTS SQL ServerMCITP Exchange Server 2007MCTS Office PerformancePoint ServerMCTS Team Foundation ServerMCPD Enterprise Application DeveloperMCTS .Net Framework 2.0: Distributed ApplicationsMCT 2008International Association of Software Architects Chapter LeaderIEEE Communications Society Board of DirectorsEuropean Datawarehouse Research
  • 3. DisclaimerThe information contained in this slide deck represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication.This slide deck is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS DOCUMENT.Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this slide deck may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation. Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this slide deck. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this slide deck does not give you any license to these patents, trademarks, copyrights, or other intellectual property.Unless otherwise noted, the example companies, organizations, products, domain names, e-mail addresses, logos, people, places and events depicted herein are fictitious, and no association with any real company, organization, product, domain name, email address, logo, person, place or event is intended or should be inferred. © 2008 Microsoft Corporation. All rights reserved.Microsoft, SQL Server, Office System, Visual Studio, SharePoint Server, Office PerformancePoint Server, .NET Framework, ProClarity Desktop Professionalare either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries.The names of actual companies and products mentioned herein may be the trademarks of their respective owners.3
  • 4. OverviewIntroducing Data Mining Office Add-InsUnderstanding Data Mining Structure ImprovementsUsing the New Time Series Algorithm4
  • 5. Introducing Data Mining Office Add-InsData Preparation TasksTools for ExplorationTools for PredictionModel Testing and Validation
  • 7. Tools for Exploration - Table Analysis Tools7
  • 8. Tools for Exploration - Data Modeling Tools
  • 9. Tools for Exploration – Model Viewers Cluster DiagramDistribution of population
  • 10. Strength of similarities between clustersOther viewers: Decision tree
  • 11. Neural network
  • 13. Time seriesCluster ProfilesDistribution of values for each attribute
  • 14. Drill through to detailsCluster CharacteristicsAttributes ordered by importance to cluster
  • 15. Probability attribute appearing in clusterCluster DiscriminationComparison of attributes between two clustersTools for Prediction - Table Analysis Tools
  • 16. Tools for Prediction - Data Modeling Tools
  • 17. Model Testing and ValidationAccuracy ChartMeasurement of model accuracy
  • 18. Lift chart comparing actual results to random guess and to perfect predictionClassification MatrixShows correct and incorrect predictions
  • 19. Displays percentage and countsProfit ChartEstimation of profit by percentage of population contacted
  • 20. Input: population, fixed cost, individual cost, revenue per individual
  • 21. Output: maximum profit, probability thresholdCross Validation – more on this later
  • 22. 1 Using the Data Mining Excel Add-Indemo
  • 23. Understanding Data Mining Structure ImprovementsData Partitioning for Training and TestingMining Model Column AliasesData Mining FiltersDrill Through to Mining Structure DataCross-Validation of a Mining Model
  • 24. Data Partitioning for Training and TestingSpecify as percentage or maximum number of casesSmaller value is used if both parameters specifiedData is divided randomly between training and testingHoldoutSeed property enables consistent partitions across structures
  • 25. Data Partitioning with DMXCreate a structure with partitioning with the HOLDOUT keywordQuery the structure to review partitions
  • 26. Mining Model Column AliasesAssign a column alias to reuse a column in a structureColumn content can be clarifiedColumn can be more easily referenced in DMXContinuous and discretized versions of the same column can be used in separate models in the same structure
  • 27. Data Mining FiltersSpecify a condition to apply to mining structure columns Filter creates subsets of training and testing data for a modelMultiple conditions can be linked with AND/OR operatorsConditions for continuous value use > , >=, <, <= operatorsConditions for discrete values use =, !=, or is null operatorsConditions on nested tables can use EXISTS keyword and subquery
  • 28. Data Mining Filters with DMXAdd a filtered mining model to a structure
  • 29. Drill Through to Mining Structure DataAdd columns to the mining structure, but not to modelsEliminates unnecessary data from model and improves processing timeSupports drill through from mining model viewer or DMX for visibility into results
  • 30. Cross-Validation of a Mining ModelPurposeValidate the accuracy of a single modelCompare models within the same mining structureProcessSplit mining structure into partitions of equal sizeIteratively build models on all partitions excluding one partition such that all partitions are excluded onceMeasure accuracy of each model using the excluded partitionAnalyze results
  • 31. Cross-Validation ParametersFold CountNumber of partitions to useMinimum 2, Maximum 256Maximum 10 for session mining structureMax CasesTotal number of cases to include in cross-validationCases divided across foldsValue of 0 specifies all casesTarget AttributePredictable column Target StateTarget value for target attributeValue of null specifies all states are to be testedTarget ThresholdValue between 0 and 1 for prediction probability above which a predicted state is considered correctValue of null specifies most probable prediction is considered correct
  • 34. 2 Creating a Clustering Modeldemo
  • 35. Using the New Time Series AlgorithmBetter Time Series SupportTime Series Algorithm Parameters
  • 36. Better Time Series SupportARTxp algorithmStill included in Microsoft Time Series algorithmBest for prediction of next likely value in a seriesARIMA algorithmAdded to Microsoft Time Series algorithmBest for long-term predictionsThe new Microsoft Time Series algorithmTrains one model using ARTxp and second model using ARIMABlends the results to return best prediction
  • 37. Time Series Algorithm Parameters
  • 38. ResourcesModel Filter Syntax and Examples, technet.microsoft.com/en-us/library/bb895186(SQL.100).aspxCross-Validation, msdn2.microsoft.com/en-us/library/bb895174(SQL.100).aspxSQL Server Data Mining, www.sqlserverdatamining.comJamie MacLennan’s blog, blogs.msdn.com/jamiemac/default.aspx
  • 40. © 2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Editor's Notes

  • #5: Data Mining Office Add-ins were introduced with SQL Server 2005, and a new version is available for SQL Server 2008 to take advantage of the improvements made to Analysis Services data mining. In this module, we’ll review how to use the Data Mining Add-ins, and then examine the changes made to mining structures as well as the new Time Series alogrithm.
  • #6: Data Mining Add-ins for Office allow you to perform a variety of data mining tasks. You can prepare data by applying data cleansing, and you can partition the data into training and test sets. Some of the add-in tools are focused on exploring your data, while other tools are built specifically for prediction purposes. The add-ins also includes functionality for testing and validating each model.Point out that the add-ins are also useful as a client viewer for data mining models developed on the server.
  • #7: This slide shows the data preparation tasks : Explore Data (to find anomalies), clean data (to handle outliers or erroenous data, and partition data to separate it into training and test data.In the background is a view used to consolidate information from several tables. Transformations have been applied to enforce business rules. This logical table is then used as the source for data mining activities –whether using the add-ins or using BI Development Studio.
  • #8: This slide identifies the table analysis tools that are exploration-based data mining tools and identifies the data mining algorithm associated with the tool.
  • #9: This slide identifies the data modeling tools that are exploration-based data mining tools and identifies the data mining algorithm associated with the tool.
  • #10: Model viewers are available not only for mining data models created by using the add-in, but also for mining models created on the server.
  • #11: This slide shows the predictive tools and shows the related algorithm.
  • #12: Prediction tools are also available in the Data Modeling ribbon of Excel. Here you see the algorithm associated with these predictive tools.
  • #13: The Data Mining add-in also includes model testing and validation tools, such as an Accuracy cart, a classification matrix, and a profit chart. Cross Validation is also new to Analysis Services data mining and will be discussed in more detail later in this module.
  • #15: In this section, we’ll review the improvements for mining structures in SSAS 2008. Specifically, we’ll look at setting up data partitions for training and testing dta, how to us aliases with mining model columns, how to apply filers to data associated with a mining model, how to drillthrough to details when studying data mining results, and how to use the cross-validation report to assess the accuracy of a model or to compare multiple models to find the best model.
  • #16: To create training and testing sets using random data for SSAS 2005, best practice was to use the Random Sample transformation in SSIS 2005. However, the package design was particularly cumbersome for structures with nested tables. In SSAS 2008, the process to generate random data sets for training and testing is built in.You can specify parameters for partitioning data into training and testing sets: In the Data Mining Wizard In the Properties pane of the mining structureAnalysis services uses a random sampling algorithm to assign data to either the training or the testing data set.If you provide both a percentage and maximum number of rows, the smaller number prevails. For example, you can specify a percentage of 30% of the entire data set which is not to exceed 1,000 rows if the data source continues to grow. When using the same data source view for multiple mining structures, you might want to keep the same partitioning strategy for each mining structure. Set the HoldoutSeed property to the same value in each structure to yield comparable results in the training and testing data sets.You can also define partitioning using DMX, AMO, or XML DDL.Point out that partitioning is not available for a model using the Time Series algorithm.
  • #17: For those who prefer to use DMX to create mining structures instead of the user interface, DMX now supports partitioning when the mining structure is created. Point out that HOLDOUT cannot be used with ALTER MINING STRUCTURE.The process to train the model – using INSERT INTO MINING STRUCTURE – is unchanged. The query executes and data is random sampled. A holdout store is created for each partition of the mining structure. In SSAS 2008, you can now query the structure to view the contents of the training and testing data sets.
  • #18: In SSAS 2005, you could change the name of a mining model column in Business Intelligence Development Studio, but not in DMX. One reason you might want to use alias a column is when you want to use the same column with different algorithms, but one algorithm supports continuous columns and the other does not. You can add a column to the mining structure more than once and set the Content property to a different value for each version of the column. Ignore the column in the model where the content type is unsupported, and include it as an input column in models supporting that content type. By enabling the use of an alias, you can use the same NATURAL PREDICTION JOIN for the models in the same mining structure because input columns are bound by name to the model column.
  • #19: Instead of creating separate data source views for your mining structure, you can create separate filtered models. Each model contains the same training and testing data which allows you to compare model results. Why create filtered models?Achieve better overall accuracy by eliminating strong patterns of one attribute value (e.g. North America versus Pacific).Compare patterns in isolated subsets of data.You can create filers: In the Model Filter dialog box In the Properties pane of the mining modelIn the case of discretized values, the bucket containing the specified value is selected. Example: Age = 23 returns bucket containing 20-25 ages.An example of a filter expression for a case table and a nested table:Gender = ‘M’ and EXISTS(select * from Products where Model = ‘Water Bottle’)Point out that NOT EXISTS is also valid.Mention the URL on the Resources slide for more information about filter syntax.You must process the mining structure to see the filter applied to the model.
  • #20: Mention that using drillthrough in a filtered model returns all cases matching the filter, whether used for training or testing.
  • #21: As in SSAS 2005, the following algorithms do not support drill through: NaïveBayes Neural Network Logistic RegressionThe Time Series algorithm supports drill through in a DMX query only; drill through is not supported in Business Intelligence Development Studio.
  • #22: Using parameters you specify, cross-validation automatically creates partitions of the data set of approximately equal size. For each partition, a mining model is created for the entire data set with one of the partitions removed, and then tested for accuracy using the partition that was excluded. If the variations are subtle, then the model generalizes well. If there is too much variation, then the model is not useful.Point out that cross-validation cannot be used with models built using the Time Series or Sequence Clustering algorithms.You can use the Cross Validation Report in the Mining Accuracy Chart of Business Intelligence Development Studio, or use Analysis Services stored procedures to create an ad hoc cross-validation SQL Server Management Studio.
  • #23: More folds results in longer processing time.
  • #24: This slide and the next outlines the types of tests and their respective measures that are found on the cross-validation report. Different models will use different test types for this report. Point out the report can be generated in Business Intelligence Development Studio, which will be shown in the demonstration, or by calling an Analysis Services stored procedure.
  • #27: Data mining in SSAS 2008 was also improved by modifying the Time Series algorithm. In this section, we’ll review why the mining structure is improved and we’ll review the algorithm parameters for the Time Series algorithm.
  • #28: In SSAS 2005, the ARTxp Time Series prediction algorithm (autoregressive tree model for multiple prior unknown states), built by Microsoft Research, was introduced. The purpose of this algorithm was to tackle a difficult business problem – how to accuractly predict the next step in a series. It was less reliable for predicting 10 steps or further out.ARIMA (autoregressive integrated moving average) is a very common time series algorithm that is well understood by seasoned data miners. It provides good predictions when projecting beyond the next 10 steps. In SSAS 2008, the Microsoft Time Series algorithm blends results of the two algorithms to leverage short and long term capabilities.In Standard Edition, you can configure your model to use one or the other algorithm, or both (which is the default). In Enterprise Edition, you can do custom weighting to get best prediction over a variable time span.
  • #29: The FORECAST_METHOD default value is MIXED. You can change this to use ARIMA or ARTXP to use a single algorithm exclusively.The PREDICTION_SMOOTHING parameter affects the weighting of the ARTxpand ARIMAalgorithms when MIXED mode is used. A value closer to 0 weights in favor of ARTxp while a value closer to 1 weights in favor of ARIMA. For example, a value of 0.8 is weighted towards ARIMA and the value of 0.2 is used for ARTxp.