SlideShare a Scribd company logo
Big Data with SQL Server


Philly SQL Server User Group
November 2012



Mark Kromer
Razorfish BI & Big Data Technology Director
http://www.kromerbigdata.com
@kromerbigdata
@mssqldude
What we’ll (try) to cover tonight

‣ What is Big Data?
‣ The Big Data and Apache Hadoop environment
‣ Big Data Analytics
‣ SQL Server in the Big Data world
‣ How we utilize Big Data @ Razorfish




                                               2
Big Data 101

‣ 3 V’s
   ‣ Volume – Terabyte records, transactions, tables, files
   ‣ Velocity – Batch, near-time, real-time (analytics), streams.
   ‣ Variety – Structures, unstructured, semi-structured, and all the above in a mix
‣ Text Processing
   ‣ Techniques for processing and analyzing unstructured (and structured) LARGE files
‣ Analytics & Insights
‣ Distributed File System & Programming
Mark’s Big Data Myths

‣ Big Data ≠ NoSQL
    ‣ NoSQL has similar Internet-scale Web origins of Hadoop stack (Yahoo!,
      Google, Facebook, et al) but not the same thing
    ‣ Facebook, for example, uses Hbase from the Hadoop stack
‣ Big Data ≠ Real Time
    ‣ Big Data is primarily about batch processing huge files in a distributed manner
      and analyzing data that was otherwise too complex to provide value
    ‣ Use in-memory analytics for real time insights
‣ Big Data ≠ Data Warehouse
    ‣ I still refer to large multi-TB DWs as “VLDB”
    ‣ Big Data is about crunching stats in text files for discovery of new patterns and
      insights
    ‣ Use the DW to aggregate and store the summaries of those calculations for
      reporting
‣   Batch Processing
‣   Commodity Hardware
‣   Data Locality, no shared storage
‣   Scales linearly
‣   Great for large text file processing, not so great on small files
‣   Distributed programming paradigm
Big Data Analytics Web Platform
In-Database Analytics (Teradata Aster)
•   Because of built-in analytics functions and big data performance, Aster becomes
    the data scientist’s sandbox and BI’s big data analytics processor.




                                                             Prepackaged Analytics
                                                             Functions (including Attribution)
SQL Server Big Data – Data Loading




Amazon HDFS & EMR          Data Loading




                Amazon S3 Bucket
SQL Server Big Data Environment

‣ SQL Server Database
   ‣   SQL Server 2008 R2 or 2012 Enterprise Edition
   ‣   Page Compression
   ‣   2012 Columnar Compression on Fact Tables
   ‣   Clustered Index on all tables
   ‣   Auto-update Stats Asynch
   ‣   Partition Fact Tables by month and archive data with sliding window technique
   ‣   Drop all indexes before nightly ETL load jobs
   ‣   Rebuild all indexes when ETL completes
‣ SQL Server Analysis Services
   ‣   SSAS 2008 R2 or 2012 Enterprise Edition
   ‣   2008 R2 OLAP cubes partition-aligned with DW
   ‣   2012 cubes in-memory tabular cubes
   ‣   All access through MSMDPUMP or SharePoint
Wrap-up

‣ What is a Big Data approach to Analytics?
   ‣ Massive scale
   ‣ Data discovery & research
   ‣ Self-service
   ‣ Reporting & BI
‣ Why did we take this Big Data Analytics approach?
   ‣ Each Web client produces an average of 6 TBs of ICA data in a year
   ‣ The data in the sources are variable and unstructured
   ‣ SSIS ETL alone couldn’t keep up or handle complexity
   ‣ SQL Server 2012 columnstore and tabular SSAS 2012 were key to using SQL
       Server for Big Data
    ‣ With the configs mentioned previously, SQL Server is working great
‣ Analytics on Big Data also requires Big Data Analytics tools
    ‣ Aster, Tableau, PowerPivot, SAS

More Related Content

What's hot (20)

PPTX
Database Choices
Lynn Langit
 
PDF
Дмитрий Лавриненко "Big & Fast Data for Identity & Telemetry services"
Fwdays
 
PPTX
Eugene Polonichko "Architecture of modern data warehouse"
Lviv Startup Club
 
PPTX
Azure Big Data Story
Lynn Langit
 
PPTX
Big data in Azure
Venkatesh Narayanan
 
PDF
Modern Data architecture Design
Kujambu Murugesan
 
PDF
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Databricks
 
PDF
Serverless data lake architecture
Maik Wiesmüller
 
PPTX
Get Savvy with Snowflake
Matillion
 
PPTX
Integration Monday - Analysing StackExchange data with Azure Data Lake
Tom Kerkhove
 
PPTX
Presto for apps deck varada prestoconf
Ori Reshef
 
PDF
Laboratorio práctico: Data warehouse en la nube
Software Guru
 
PPTX
TechDays NL 2016 - Building your scalable secure IoT Solution on Azure
Tom Kerkhove
 
PDF
DBP-010_Using Azure Data Services for Modern Data Applications
decode2016
 
PDF
Redshift VS BigQuery
Kostas Pardalis
 
PDF
Azure Data Lake Store and Analytics
Sergio Zenatti Filho
 
PPTX
Altis AWS Snowflake Practice
SamanthaSwain7
 
PDF
Building a Data Lake on AWS
Gary Stafford
 
PPTX
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
DataStax Academy
 
PDF
Unleash the Power of Azure Data Factory - SQL User Group
Sergio Zenatti Filho
 
Database Choices
Lynn Langit
 
Дмитрий Лавриненко "Big & Fast Data for Identity & Telemetry services"
Fwdays
 
Eugene Polonichko "Architecture of modern data warehouse"
Lviv Startup Club
 
Azure Big Data Story
Lynn Langit
 
Big data in Azure
Venkatesh Narayanan
 
Modern Data architecture Design
Kujambu Murugesan
 
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Databricks
 
Serverless data lake architecture
Maik Wiesmüller
 
Get Savvy with Snowflake
Matillion
 
Integration Monday - Analysing StackExchange data with Azure Data Lake
Tom Kerkhove
 
Presto for apps deck varada prestoconf
Ori Reshef
 
Laboratorio práctico: Data warehouse en la nube
Software Guru
 
TechDays NL 2016 - Building your scalable secure IoT Solution on Azure
Tom Kerkhove
 
DBP-010_Using Azure Data Services for Modern Data Applications
decode2016
 
Redshift VS BigQuery
Kostas Pardalis
 
Azure Data Lake Store and Analytics
Sergio Zenatti Filho
 
Altis AWS Snowflake Practice
SamanthaSwain7
 
Building a Data Lake on AWS
Gary Stafford
 
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
DataStax Academy
 
Unleash the Power of Azure Data Factory - SQL User Group
Sergio Zenatti Filho
 

Viewers also liked (20)

PPTX
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
Mark Kromer
 
PPTX
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
Mark Kromer
 
PPTX
What's new in SQL Server 2012 for philly code camp 2012.1
Mark Kromer
 
PPTX
Microsoft Event Registration System Hosted on Windows Azure
Mark Kromer
 
PPTX
Big Data in the Cloud with Azure Marketplace Images
Mark Kromer
 
DOCX
MEC Data sheet
Mark Kromer
 
PPTX
Big Data with SQL Server
Mark Kromer
 
PPTX
Pentaho Big Data Analytics with Vertica and Hadoop
Mark Kromer
 
PPTX
Anexinet Big Data Solutions
Mark Kromer
 
PPTX
Big Data in the Real World
Mark Kromer
 
PPTX
Pentaho Analytics on MongoDB
Mark Kromer
 
PPTX
Big Data Analytics Projects - Real World with Pentaho
Mark Kromer
 
PPTX
Sql server 2012 roadshow masd overview 003
Mark Kromer
 
PPTX
Microsoft SQL Server Data Warehouses for SQL Server DBAs
Mark Kromer
 
PPTX
Azure vs. amazon
Omid Vahdaty
 
PPTX
Big Data Analytics with Hadoop, MongoDB and SQL Server
Mark Kromer
 
PPTX
ETL in the Cloud With Microsoft Azure
Mark Kromer
 
PPTX
Azure cafe marketplace with looker data analytics
Mark Kromer
 
PPTX
AWS vs Azure - Cloud Services Comparison
Aniket Kanitkar
 
PPTX
Big Data Analytics in the Cloud with Microsoft Azure
Mark Kromer
 
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
Mark Kromer
 
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
Mark Kromer
 
What's new in SQL Server 2012 for philly code camp 2012.1
Mark Kromer
 
Microsoft Event Registration System Hosted on Windows Azure
Mark Kromer
 
Big Data in the Cloud with Azure Marketplace Images
Mark Kromer
 
MEC Data sheet
Mark Kromer
 
Big Data with SQL Server
Mark Kromer
 
Pentaho Big Data Analytics with Vertica and Hadoop
Mark Kromer
 
Anexinet Big Data Solutions
Mark Kromer
 
Big Data in the Real World
Mark Kromer
 
Pentaho Analytics on MongoDB
Mark Kromer
 
Big Data Analytics Projects - Real World with Pentaho
Mark Kromer
 
Sql server 2012 roadshow masd overview 003
Mark Kromer
 
Microsoft SQL Server Data Warehouses for SQL Server DBAs
Mark Kromer
 
Azure vs. amazon
Omid Vahdaty
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Mark Kromer
 
ETL in the Cloud With Microsoft Azure
Mark Kromer
 
Azure cafe marketplace with looker data analytics
Mark Kromer
 
AWS vs Azure - Cloud Services Comparison
Aniket Kanitkar
 
Big Data Analytics in the Cloud with Microsoft Azure
Mark Kromer
 
Ad

Similar to PSSUG Nov 2012: Big Data with SQL Server (20)

PPTX
Building a modern data warehouse
James Serra
 
PPTX
Data Lake Overview
James Serra
 
PPTX
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
Data Con LA
 
PDF
Prague data management meetup 2018-03-27
Martin Bém
 
PPTX
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Carole Gunst
 
PPTX
A Tale of 2 BI Standards: One for Data Warehouses and One for Data Lakes
Arcadia Data
 
PDF
A Tale of Two BI Standards
Arcadia Data
 
PDF
IBM THINK 2019 - What? I Don't Need a Database to Do All That with SQL?
Torsten Steinbach
 
PDF
Big Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKES
Matt Stubbs
 
PPTX
Transform your DBMS to drive engagement innovation with Big Data
Ashnikbiz
 
PDF
Meetup Oracle Database BCN: 2.1 Data Management Trends
avanttic Consultoría Tecnológica
 
PPTX
Accelerating Big Data Analytics
Attunity
 
PPTX
Is the traditional data warehouse dead?
James Serra
 
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
PDF
Next Generation Data Platforms - Deon Thomas
Thoughtworks
 
PDF
Module 2 - Datalake
Lam Le
 
PPTX
Microsoft Data Platform - What's included
James Serra
 
PDF
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
Rittman Analytics
 
PDF
An overview of modern scalable web development
Tung Nguyen
 
PDF
the Data World Distilled
RTTS
 
Building a modern data warehouse
James Serra
 
Data Lake Overview
James Serra
 
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
Data Con LA
 
Prague data management meetup 2018-03-27
Martin Bém
 
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Carole Gunst
 
A Tale of 2 BI Standards: One for Data Warehouses and One for Data Lakes
Arcadia Data
 
A Tale of Two BI Standards
Arcadia Data
 
IBM THINK 2019 - What? I Don't Need a Database to Do All That with SQL?
Torsten Steinbach
 
Big Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKES
Matt Stubbs
 
Transform your DBMS to drive engagement innovation with Big Data
Ashnikbiz
 
Meetup Oracle Database BCN: 2.1 Data Management Trends
avanttic Consultoría Tecnológica
 
Accelerating Big Data Analytics
Attunity
 
Is the traditional data warehouse dead?
James Serra
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
Next Generation Data Platforms - Deon Thomas
Thoughtworks
 
Module 2 - Datalake
Lam Le
 
Microsoft Data Platform - What's included
James Serra
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
Rittman Analytics
 
An overview of modern scalable web development
Tung Nguyen
 
the Data World Distilled
RTTS
 
Ad

More from Mark Kromer (20)

PPTX
Fabric Data Factory Pipeline Copy Perf Tips.pptx
Mark Kromer
 
PPTX
Build data quality rules and data cleansing into your data pipelines
Mark Kromer
 
PPTX
Mapping Data Flows Training deck Q1 CY22
Mark Kromer
 
PPTX
Data cleansing and prep with synapse data flows
Mark Kromer
 
PPTX
Data cleansing and data prep with synapse data flows
Mark Kromer
 
PPTX
Mapping Data Flows Training April 2021
Mark Kromer
 
PPTX
Mapping Data Flows Perf Tuning April 2021
Mark Kromer
 
PPTX
Data Lake ETL in the Cloud with ADF
Mark Kromer
 
PPTX
Azure Data Factory Data Wrangling with Power Query
Mark Kromer
 
PPTX
Azure Data Factory Data Flow Performance Tuning 101
Mark Kromer
 
PPTX
Data Quality Patterns in the Cloud with ADF
Mark Kromer
 
PPTX
Azure Data Factory Data Flows Training (Sept 2020 Update)
Mark Kromer
 
PPTX
Data quality patterns in the cloud with ADF
Mark Kromer
 
PPTX
Azure Data Factory Data Flows Training v005
Mark Kromer
 
PPTX
Data Quality Patterns in the Cloud with Azure Data Factory
Mark Kromer
 
PPTX
ADF Mapping Data Flows Level 300
Mark Kromer
 
PPTX
ADF Mapping Data Flows Training V2
Mark Kromer
 
PPTX
ADF Mapping Data Flows Training Slides V1
Mark Kromer
 
PDF
ADF Mapping Data Flow Private Preview Migration
Mark Kromer
 
PPTX
Azure Data Factory ETL Patterns in the Cloud
Mark Kromer
 
Fabric Data Factory Pipeline Copy Perf Tips.pptx
Mark Kromer
 
Build data quality rules and data cleansing into your data pipelines
Mark Kromer
 
Mapping Data Flows Training deck Q1 CY22
Mark Kromer
 
Data cleansing and prep with synapse data flows
Mark Kromer
 
Data cleansing and data prep with synapse data flows
Mark Kromer
 
Mapping Data Flows Training April 2021
Mark Kromer
 
Mapping Data Flows Perf Tuning April 2021
Mark Kromer
 
Data Lake ETL in the Cloud with ADF
Mark Kromer
 
Azure Data Factory Data Wrangling with Power Query
Mark Kromer
 
Azure Data Factory Data Flow Performance Tuning 101
Mark Kromer
 
Data Quality Patterns in the Cloud with ADF
Mark Kromer
 
Azure Data Factory Data Flows Training (Sept 2020 Update)
Mark Kromer
 
Data quality patterns in the cloud with ADF
Mark Kromer
 
Azure Data Factory Data Flows Training v005
Mark Kromer
 
Data Quality Patterns in the Cloud with Azure Data Factory
Mark Kromer
 
ADF Mapping Data Flows Level 300
Mark Kromer
 
ADF Mapping Data Flows Training V2
Mark Kromer
 
ADF Mapping Data Flows Training Slides V1
Mark Kromer
 
ADF Mapping Data Flow Private Preview Migration
Mark Kromer
 
Azure Data Factory ETL Patterns in the Cloud
Mark Kromer
 

Recently uploaded (20)

PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PPTX
Q2 Leading a Tableau User Group - Onboarding
lward7
 
PPT
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
July Patch Tuesday
Ivanti
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PPTX
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
PPTX
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
PDF
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Q2 Leading a Tableau User Group - Onboarding
lward7
 
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
July Patch Tuesday
Ivanti
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 

PSSUG Nov 2012: Big Data with SQL Server

  • 1. Big Data with SQL Server Philly SQL Server User Group November 2012 Mark Kromer Razorfish BI & Big Data Technology Director http://www.kromerbigdata.com @kromerbigdata @mssqldude
  • 2. What we’ll (try) to cover tonight ‣ What is Big Data? ‣ The Big Data and Apache Hadoop environment ‣ Big Data Analytics ‣ SQL Server in the Big Data world ‣ How we utilize Big Data @ Razorfish 2
  • 3. Big Data 101 ‣ 3 V’s ‣ Volume – Terabyte records, transactions, tables, files ‣ Velocity – Batch, near-time, real-time (analytics), streams. ‣ Variety – Structures, unstructured, semi-structured, and all the above in a mix ‣ Text Processing ‣ Techniques for processing and analyzing unstructured (and structured) LARGE files ‣ Analytics & Insights ‣ Distributed File System & Programming
  • 4. Mark’s Big Data Myths ‣ Big Data ≠ NoSQL ‣ NoSQL has similar Internet-scale Web origins of Hadoop stack (Yahoo!, Google, Facebook, et al) but not the same thing ‣ Facebook, for example, uses Hbase from the Hadoop stack ‣ Big Data ≠ Real Time ‣ Big Data is primarily about batch processing huge files in a distributed manner and analyzing data that was otherwise too complex to provide value ‣ Use in-memory analytics for real time insights ‣ Big Data ≠ Data Warehouse ‣ I still refer to large multi-TB DWs as “VLDB” ‣ Big Data is about crunching stats in text files for discovery of new patterns and insights ‣ Use the DW to aggregate and store the summaries of those calculations for reporting
  • 5. Batch Processing ‣ Commodity Hardware ‣ Data Locality, no shared storage ‣ Scales linearly ‣ Great for large text file processing, not so great on small files ‣ Distributed programming paradigm
  • 6. Big Data Analytics Web Platform
  • 7. In-Database Analytics (Teradata Aster) • Because of built-in analytics functions and big data performance, Aster becomes the data scientist’s sandbox and BI’s big data analytics processor. Prepackaged Analytics Functions (including Attribution)
  • 8. SQL Server Big Data – Data Loading Amazon HDFS & EMR Data Loading Amazon S3 Bucket
  • 9. SQL Server Big Data Environment ‣ SQL Server Database ‣ SQL Server 2008 R2 or 2012 Enterprise Edition ‣ Page Compression ‣ 2012 Columnar Compression on Fact Tables ‣ Clustered Index on all tables ‣ Auto-update Stats Asynch ‣ Partition Fact Tables by month and archive data with sliding window technique ‣ Drop all indexes before nightly ETL load jobs ‣ Rebuild all indexes when ETL completes ‣ SQL Server Analysis Services ‣ SSAS 2008 R2 or 2012 Enterprise Edition ‣ 2008 R2 OLAP cubes partition-aligned with DW ‣ 2012 cubes in-memory tabular cubes ‣ All access through MSMDPUMP or SharePoint
  • 10. Wrap-up ‣ What is a Big Data approach to Analytics? ‣ Massive scale ‣ Data discovery & research ‣ Self-service ‣ Reporting & BI ‣ Why did we take this Big Data Analytics approach? ‣ Each Web client produces an average of 6 TBs of ICA data in a year ‣ The data in the sources are variable and unstructured ‣ SSIS ETL alone couldn’t keep up or handle complexity ‣ SQL Server 2012 columnstore and tabular SSAS 2012 were key to using SQL Server for Big Data ‣ With the configs mentioned previously, SQL Server is working great ‣ Analytics on Big Data also requires Big Data Analytics tools ‣ Aster, Tableau, PowerPivot, SAS