SlideShare a Scribd company logo
On the move with Big Data
Hadoop, Pig, Sqoop, SSIS…

Stéphane Fréchette
Thursday February 13, 2014
Who am I?
My name is Stéphane Fréchette

SQL Server MVP - I’m a Database & Business Intelligence Professional and Founder | CEO
of
I have a passion for architecting, designing and building solutions that matter.
Self proclaimed Open Data Hacker/Advocate I founded Gatineau Ouverte a citizen led
initiative which aims to promote open access to civic data of the city of Gatineau.

Twitter: @sfrechette
Blog: stephanefrechette.com
Email: stephanefrechette@ukubu.com
Session Outline
• What is Big Data?
• Apache Hadoop
• Hadoop Ecosystem
• Windows Azure HDInsight
• On the move…
• SSIS, Sqoop, Pig

• Demos
• Resources
What is Big Data?

4
Apache Hadoop
• Open-source software framework that allows for the distributed processing
of large data sets across clusters of computers using simple programming
models
• Designed to scale up from single servers to thousands of machines, each
offering local computation and storage
Hadoop Ecosystem
• Core components;
• HDFS (Hadoop Distributed File System) -> Storage
• MapReduce -> Processing
What is Pig?
• Write complex MapReduce jobs using a simple script language (Pig Latin)

• A platform for analyzing large data sets that consists of high-level language
for expressing data analysis programs
• Pig translates and compiles complex MapReduce jobs on the fly

http://pig.apache.org
What is Sqoop?
• Command-line interface application to transfer bulk data between Hadoop
and relational datastores

http://sqoop.apache.org
What is Hive?
• A data warehouse infrastructure built on top of Hadoop for providing data
summarization, query, and analysis
• Provides an SQL-Like language called HiveQL to query data
• Integration between Hadoop and BI and visualization tools

http://hive.apache.org
What is SSIS?
• SQL Server Integration Services is a platform for data integration and
workflow applications. A fast and flexible tool used for data extraction,
transformation, and loading (ETL).
• Contains rich set of built-in tasks and transformations; tools for constructing
packages…
• Used to solve complex business problems
Windows Azure HDInsight
• HDInsight is a Hadoop-based service from Microsoft that brings a 100
percent Apache Hadoop solution to the cloud
• Based on the Hortonworks Data Platform
• Scalable, on-demand service
Demos
(let’s move some data…)
Resources
•
•
•
•
•
•
•
•
•

Apache Projects (list with links) http://bit.ly/MfpLtE
Windows Azure HDInsight http://bit.ly/1dnlAX1
HDInsight Tutorials and Guide http://bit.ly/LWRYol
Hortonworks Sandbox 2.0 http://bit.ly/1gkkCte
Hortonworks Tutorial Gallery http://bit.ly/1nvMAEX
Microsoft JDBC Driver 4.0 for SQL Server http://bit.ly/1kEgJ7O
Microsoft Hive ODBC Driver http://bit.ly/NFkhcH
GitHub: WindowsAzure / azure-content http://bit.ly/1hfthlF
SSIS Custom Task – Disorderly Data (Ken Ross) http://bit.ly/1nvIH2G
• GitHub https://github.com/kzhen/SSISHDFS
What Questions Do You Have?
Thank You
For attending this session

More Related Content

PPTX
Introduction to Azure HDInsight
Stéphane Fréchette
 
PDF
SQL Server 2014 Faster Insights from Any Data
Stéphane Fréchette
 
PPTX
Big Data on azure
David Giard
 
PPTX
Graph Databases for SQL Server Professionals
Stéphane Fréchette
 
PPTX
Azure Data Lake Intro (SQLBits 2016)
Michael Rys
 
PPTX
Big Data Analytics in the Cloud with Microsoft Azure
Mark Kromer
 
PPTX
SQLNexus Copenhaguen - Pipeline for the new oil: Azure Data Factory, Hybrid D...
Jean-Pierre Riehl
 
PPTX
Big data in Azure
Venkatesh Narayanan
 
Introduction to Azure HDInsight
Stéphane Fréchette
 
SQL Server 2014 Faster Insights from Any Data
Stéphane Fréchette
 
Big Data on azure
David Giard
 
Graph Databases for SQL Server Professionals
Stéphane Fréchette
 
Azure Data Lake Intro (SQLBits 2016)
Michael Rys
 
Big Data Analytics in the Cloud with Microsoft Azure
Mark Kromer
 
SQLNexus Copenhaguen - Pipeline for the new oil: Azure Data Factory, Hybrid D...
Jean-Pierre Riehl
 
Big data in Azure
Venkatesh Narayanan
 

What's hot (20)

PPTX
Webinar - Introduction to Azure Data Lake
Josh Lane
 
PPTX
How to boost your datamanagement with Dremio ?
Vincent Terrasi
 
PPTX
Building a Big Data Pipeline
Jesus Rodriguez
 
PPTX
Introduction to PolyBase
James Serra
 
PPTX
A lap around Azure Data Factory
BizTalk360
 
PPTX
Introduction to Microsoft’s Hadoop solution (HDInsight)
James Serra
 
PDF
Big data on Azure for Architects
Tomasz Kopacz
 
PPTX
Azure cafe marketplace with looker data analytics
Mark Kromer
 
PDF
Cortana Analytics Workshop: Azure Data Lake
MSAdvAnalytics
 
PDF
Bi on Big Data - Strata 2016 in London
Dremio Corporation
 
PPTX
Data lake – On Premise VS Cloud
Idan Tohami
 
PPTX
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Michael Rys
 
PPTX
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
DataConf
 
PPTX
Options for Data Prep - A Survey of the Current Market
Dremio Corporation
 
PPTX
Big Data - HDInsight and Power BI
Prasad Prabhu (PP)
 
PPTX
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Michael Rys
 
PDF
Azure Synapse Analytics Teaser (Microsoft TechX Oslo 2019)
Cathrine Wilhelmsen
 
PDF
Building a Data Pipeline With Tools From the Hadoop Ecosystem - StampedeCon 2016
StampedeCon
 
PPTX
Global AI Bootcamp Madrid - Azure Databricks
Alberto Diaz Martin
 
PDF
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive
 
Webinar - Introduction to Azure Data Lake
Josh Lane
 
How to boost your datamanagement with Dremio ?
Vincent Terrasi
 
Building a Big Data Pipeline
Jesus Rodriguez
 
Introduction to PolyBase
James Serra
 
A lap around Azure Data Factory
BizTalk360
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
James Serra
 
Big data on Azure for Architects
Tomasz Kopacz
 
Azure cafe marketplace with looker data analytics
Mark Kromer
 
Cortana Analytics Workshop: Azure Data Lake
MSAdvAnalytics
 
Bi on Big Data - Strata 2016 in London
Dremio Corporation
 
Data lake – On Premise VS Cloud
Idan Tohami
 
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Michael Rys
 
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
DataConf
 
Options for Data Prep - A Survey of the Current Market
Dremio Corporation
 
Big Data - HDInsight and Power BI
Prasad Prabhu (PP)
 
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Michael Rys
 
Azure Synapse Analytics Teaser (Microsoft TechX Oslo 2019)
Cathrine Wilhelmsen
 
Building a Data Pipeline With Tools From the Hadoop Ecosystem - StampedeCon 2016
StampedeCon
 
Global AI Bootcamp Madrid - Azure Databricks
Alberto Diaz Martin
 
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive
 
Ad

Viewers also liked (8)

PPT
Big data analytics -hive
karthika karthi
 
PDF
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
Cloudera, Inc.
 
PPTX
6.hive
Prashant Gupta
 
PDF
Hiveハンズオン
Satoshi Noto
 
PDF
Programming Hive Reading #4
moai kids
 
PDF
Programming Hive Reading #3
moai kids
 
PPT
Hive Object Model
Zheng Shao
 
PPTX
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
 
Big data analytics -hive
karthika karthi
 
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
Cloudera, Inc.
 
Hiveハンズオン
Satoshi Noto
 
Programming Hive Reading #4
moai kids
 
Programming Hive Reading #3
moai kids
 
Hive Object Model
Zheng Shao
 
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
 
Ad

Similar to On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...) (20)

PPTX
Stéphane Fréchette - Samedi SQL - Introduction to HDInsight
MSDEVMTL
 
PPTX
Windows Azure HDInsight Service
Neil Mackenzie
 
PPT
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Andrew Brust
 
PPTX
Microsoft's Big Play for Big Data
Andrew Brust
 
PDF
Microsoft Big Data
Dr. Wilfred Lin (Ph.D.)
 
PPTX
Big Data and NoSQL for Database and BI Pros
Andrew Brust
 
PPTX
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Andrew Brust
 
PPTX
Big Data on the Microsoft Platform
Andrew Brust
 
PPTX
Big Data in the Microsoft Platform
Jesus Rodriguez
 
PPTX
Big Data in the Real World
Mark Kromer
 
PDF
SQL Server Konferenz 2014 - SSIS & HDInsight
Tillmann Eitelberg
 
PPTX
Big Data and NoSQL for Database and BI Pros
Andrew Brust
 
PPTX
מיכאל
sqlserver.co.il
 
PPTX
Overview of big data & hadoop v1
Thanh Nguyen
 
PDF
Azure HDInsight
Koray Kocabas
 
PPTX
SQL Server 2012 and Big Data
Microsoft TechNet - Belgium and Luxembourg
 
PPTX
Case study on big data
Khushboo Kumari
 
ODP
Hadoop introduction
葵慶 李
 
PDF
Hadoop Fundamentals I
Romeo Kienzler
 
PPTX
Overview of big data & hadoop version 1 - Tony Nguyen
Thanh Nguyen
 
Stéphane Fréchette - Samedi SQL - Introduction to HDInsight
MSDEVMTL
 
Windows Azure HDInsight Service
Neil Mackenzie
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Andrew Brust
 
Microsoft's Big Play for Big Data
Andrew Brust
 
Microsoft Big Data
Dr. Wilfred Lin (Ph.D.)
 
Big Data and NoSQL for Database and BI Pros
Andrew Brust
 
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Andrew Brust
 
Big Data on the Microsoft Platform
Andrew Brust
 
Big Data in the Microsoft Platform
Jesus Rodriguez
 
Big Data in the Real World
Mark Kromer
 
SQL Server Konferenz 2014 - SSIS & HDInsight
Tillmann Eitelberg
 
Big Data and NoSQL for Database and BI Pros
Andrew Brust
 
מיכאל
sqlserver.co.il
 
Overview of big data & hadoop v1
Thanh Nguyen
 
Azure HDInsight
Koray Kocabas
 
SQL Server 2012 and Big Data
Microsoft TechNet - Belgium and Luxembourg
 
Case study on big data
Khushboo Kumari
 
Hadoop introduction
葵慶 李
 
Hadoop Fundamentals I
Romeo Kienzler
 
Overview of big data & hadoop version 1 - Tony Nguyen
Thanh Nguyen
 

More from Stéphane Fréchette (15)

PPTX
Back to the future - Temporal Table in SQL Server 2016
Stéphane Fréchette
 
PPTX
Self-Service Data Integration with Power Query - SQLSaturday #364 Boston
Stéphane Fréchette
 
PPTX
Power BI - Bring your data together
Stéphane Fréchette
 
PPTX
Data Analytics with R and SQL Server
Stéphane Fréchette
 
PPTX
Self-Service Data Integration with Power Query
Stéphane Fréchette
 
PDF
Le journalisme de données... par où commencer?
Stéphane Fréchette
 
PPTX
Modernizing Your Data Warehouse using APS
Stéphane Fréchette
 
PPTX
Graph Databases for SQL Server Professionals - SQLSaturday #350 Winnipeg
Stéphane Fréchette
 
PPTX
TEDxGatineau
Stéphane Fréchette
 
PPTX
Power BI
Stéphane Fréchette
 
PPTX
Introduction to Master Data Services in SQL Server 2012
Stéphane Fréchette
 
PDF
Data Quality Services in SQL Server 2012
Stéphane Fréchette
 
PDF
Business Intelligence in Excel 2013
Stéphane Fréchette
 
KEY
Gatineau Ouverte troisième rencontre publique
Stéphane Fréchette
 
KEY
Gatineau Ouverte première rencontre publique
Stéphane Fréchette
 
Back to the future - Temporal Table in SQL Server 2016
Stéphane Fréchette
 
Self-Service Data Integration with Power Query - SQLSaturday #364 Boston
Stéphane Fréchette
 
Power BI - Bring your data together
Stéphane Fréchette
 
Data Analytics with R and SQL Server
Stéphane Fréchette
 
Self-Service Data Integration with Power Query
Stéphane Fréchette
 
Le journalisme de données... par où commencer?
Stéphane Fréchette
 
Modernizing Your Data Warehouse using APS
Stéphane Fréchette
 
Graph Databases for SQL Server Professionals - SQLSaturday #350 Winnipeg
Stéphane Fréchette
 
TEDxGatineau
Stéphane Fréchette
 
Introduction to Master Data Services in SQL Server 2012
Stéphane Fréchette
 
Data Quality Services in SQL Server 2012
Stéphane Fréchette
 
Business Intelligence in Excel 2013
Stéphane Fréchette
 
Gatineau Ouverte troisième rencontre publique
Stéphane Fréchette
 
Gatineau Ouverte première rencontre publique
Stéphane Fréchette
 

Recently uploaded (20)

PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
Software Development Methodologies in 2025
KodekX
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 

On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)

  • 1. On the move with Big Data Hadoop, Pig, Sqoop, SSIS… Stéphane Fréchette Thursday February 13, 2014
  • 2. Who am I? My name is Stéphane Fréchette SQL Server MVP - I’m a Database & Business Intelligence Professional and Founder | CEO of I have a passion for architecting, designing and building solutions that matter. Self proclaimed Open Data Hacker/Advocate I founded Gatineau Ouverte a citizen led initiative which aims to promote open access to civic data of the city of Gatineau. Twitter: @sfrechette Blog: stephanefrechette.com Email: [email protected]
  • 3. Session Outline • What is Big Data? • Apache Hadoop • Hadoop Ecosystem • Windows Azure HDInsight • On the move… • SSIS, Sqoop, Pig • Demos • Resources
  • 4. What is Big Data? 4
  • 5. Apache Hadoop • Open-source software framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models • Designed to scale up from single servers to thousands of machines, each offering local computation and storage
  • 6. Hadoop Ecosystem • Core components; • HDFS (Hadoop Distributed File System) -> Storage • MapReduce -> Processing
  • 7. What is Pig? • Write complex MapReduce jobs using a simple script language (Pig Latin) • A platform for analyzing large data sets that consists of high-level language for expressing data analysis programs • Pig translates and compiles complex MapReduce jobs on the fly http://pig.apache.org
  • 8. What is Sqoop? • Command-line interface application to transfer bulk data between Hadoop and relational datastores http://sqoop.apache.org
  • 9. What is Hive? • A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis • Provides an SQL-Like language called HiveQL to query data • Integration between Hadoop and BI and visualization tools http://hive.apache.org
  • 10. What is SSIS? • SQL Server Integration Services is a platform for data integration and workflow applications. A fast and flexible tool used for data extraction, transformation, and loading (ETL). • Contains rich set of built-in tasks and transformations; tools for constructing packages… • Used to solve complex business problems
  • 11. Windows Azure HDInsight • HDInsight is a Hadoop-based service from Microsoft that brings a 100 percent Apache Hadoop solution to the cloud • Based on the Hortonworks Data Platform • Scalable, on-demand service
  • 13. Resources • • • • • • • • • Apache Projects (list with links) http://bit.ly/MfpLtE Windows Azure HDInsight http://bit.ly/1dnlAX1 HDInsight Tutorials and Guide http://bit.ly/LWRYol Hortonworks Sandbox 2.0 http://bit.ly/1gkkCte Hortonworks Tutorial Gallery http://bit.ly/1nvMAEX Microsoft JDBC Driver 4.0 for SQL Server http://bit.ly/1kEgJ7O Microsoft Hive ODBC Driver http://bit.ly/NFkhcH GitHub: WindowsAzure / azure-content http://bit.ly/1hfthlF SSIS Custom Task – Disorderly Data (Ken Ross) http://bit.ly/1nvIH2G • GitHub https://github.com/kzhen/SSISHDFS
  • 14. What Questions Do You Have?
  • 15. Thank You For attending this session