SlideShare a Scribd company logo
Enabling Next Gen Analytics with
Azure Data Lake
Microsoft Azure
Microsoft Cloud
Global Trusted Hybrid
Big Data Definition
Big data is high-volume, high-velocity
and/or high-variety information assets
that demand cost-effective, innovative
forms of information processing that
enable enhanced insight, decision
making, and process automation.
– Gartner, Big Data Definition*
* Gartner, Big Data (Stamford, CT.: Gartner, 2016), URL: http://www.gartner.com/it-glossary/big-data/
Big Data as a Cornerstone of
Cortana Intelligence
Action
People
Automated
Systems
Apps
Web
Mobile
Bots
Intelligence
Dashboards &
Visualizations
Cortana
Bot
Framework
Cognitive
Services
Power BI
Information
Management
Event Hubs
Data Catalog
Data Factory
Machine Learning
and Analytics
HDInsight
(Hadoop and
Spark)
Stream Analytics
Intelligence
Data Lake
Analytics
Machine Learning
Big Data Stores
SQL Data
Warehouse
Data Lake Store
Data
Sources
Apps
Sensors
and
devices
Data
However, there are challenges to Big Data…
Obtaining skills
and capabilities
Determining how
to get value
Integrating with
existing IT
investments
*Gartner: Survey Analysis – Hadoop Adoption Drivers and Challenges (Stamford, CT.: Gartner, 2015)
Azure
HDInsight
A Cloud Spark and
Hadoop service for the
Enterprise
Reliable with an industry leading SLA
Enterprise-grade security and monitoring
Productive platform for developers and
scientists
Cost effective cloud scale
Integration with leading ISV applications
Easy for administrators to manage
63% lower TCO than deploy your own
Hadoop on-premises*
*IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”
• One-click deploy experience for installing apps.
• Fully managed PaaS offering.
• Access to entire cluster and secure by default.
• Install apps on new or existing clusters.
• Ease of authoring and deployment.
• Certified partners only.
HDInsight Application Platform
Hybrid cloud, a reality today
74%
Enterprises believe a
hybrid cloud will enable
business growth1
82%
Enterprises have a hybrid
cloud strategy, up from 74
percent a year ago2
Workload
requirements
Regulation
Sensitive data
Customization
Latency
Legacy support
Introduction to StreamSets
for Microsoft Azure
Who is StreamSets?
Enterprise Data DNA
StreamSets Mission
Top-tier Investors Commercial Customers Across Verticals
150,000 downloads
⅓ of the Fortune 100
Empower enterprises to harness their data in motion.
Products
StreamSets Dataflow Performance Manager™ (DPM)
StreamSets Data Collector™ (open source)
Strong Partner Ecosystem Open Source Success
StreamSets Solution
Desired Business Outcomes
● Developer & operator
efficiency
● On-time delivery
● Data trust & governance
Data in motion middleware that ensures data trust.
StreamSets
Dataflow Performance Manager (DPM)
StreamSets Products
StreamSets
Data Collector (SDC)
Open source tooling and engine to
build complex any-to-any dataflows.
Cloud Service to
map, measure and master
dataflow operations.
DATAFLOW LIFECYCLE
DEVELOP OPERATE
EVOLVE (Proactive)
REMEDIATE (Reactive)
● Developers
● Scientists
● Architects
● Operators
● Stewards
● Architects
StreamSets Deployment Models
Install on
Local Machine
Install on
Azure VM
StreamSets Deployment Models
StreamSets and Microsoft Azure
in Use in a Major Bank
The Customer
● Forbes Global 500 financial services company.
● Adopting and moving into cloud at rapid phase.
● Growing rapidly both via acquisitions and organic growth.
Key Challenges Related to
Data Movement
● Number of legacy tools both customer and vendor built.
● Security policy changes very hard to manage.
● Lack of security governance due to fragmentation of tools and lack of
standardization.
● Difficulty onboarding new data sources as soon as the are created
(technology change).
● Data drift (unexpected changes) very hard to manage at scale.
Key Factors for the Customer to
Consider Streamsets
● KPIs
● Delivery guarantees
● Multiple types of origins and destinations using a single tool.
● Works natively with Microsoft Azure as part of HDInsight or Azure
Virtual Machine or deployed on premise.
● Visualization of actual data transfers.
● Define security boundaries, actors etc.
● Repeating pattern
Customer’s Business Objectives
● Short Compute and Long Storage (ADLS,Azure Blob) in turn fine-grained
cost control.
● Ability to build microanalytics framework. For instance, instead of taking
entire dataset, build same micro datasets and build microanalytics
framework and derive results faster (faster iteration).
● Move away from traditional Data Lake to Azure Data Lake to manage
cost and scale.
Use Cases for StreamSets
Use Cases
1. Data Movement from On-Premise to
Azure Data Lake
2. Consolidating Migration tools into
single tool
3. Building DR for HDInsight Kafka
workloads.
Resources / Q & A
StreamSets Data Collector @ Azure Marketplace
https://azure.microsoft.com/en-us/marketplace/partners/streamsets/streamsets-data-collector/
Ingest Data into Microsoft Azure Data Lake (YouTube)
https://www.youtube.com/watch?v=c1dVnOK7Luw
StreamSets Community
https://streamsets.com/community/
StreamSets Dataflow Performance Manager Product Information
https://streamsets.com/products/dpm/
Thanks!

More Related Content

What's hot (20)

PPTX
Spark Summit Keynote by Suren Nathan
Spark Summit
 
PPTX
Building Continuously Curated Ingestion Pipelines
Arvind Prabhakar
 
PPTX
Optimizing industrial operations using the big data ecosystem
DataWorks Summit
 
PDF
About CDAP
Cask Data
 
PPTX
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Dataconomy Media
 
PPTX
Intuit Analytics Cloud 101
DataWorks Summit/Hadoop Summit
 
PDF
The Hidden Value of Hadoop Migration
Databricks
 
PPTX
Loan Decisioning Transformation
DataWorks Summit/Hadoop Summit
 
PPTX
The modern analytics architecture
Joseph D'Antoni
 
PDF
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
PPTX
The Stream is the Database - Revolutionizing Healthcare Data Architecture
DataWorks Summit/Hadoop Summit
 
PDF
Managing R&D Data on Parallel Compute Infrastructure
Databricks
 
PPTX
Optimize Data for the Logical Data Warehouse
Attunity
 
PDF
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Flink Forward
 
PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
PDF
Delta Lake: Open Source Reliability w/ Apache Spark
George Chow
 
PPTX
Disrupting Insurance with Advanced Analytics The Next Generation Carrier
DataWorks Summit/Hadoop Summit
 
PPTX
Which data should you move to Hadoop?
Attunity
 
PPTX
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
Cask Data
 
PDF
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
Spark Summit Keynote by Suren Nathan
Spark Summit
 
Building Continuously Curated Ingestion Pipelines
Arvind Prabhakar
 
Optimizing industrial operations using the big data ecosystem
DataWorks Summit
 
About CDAP
Cask Data
 
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Dataconomy Media
 
Intuit Analytics Cloud 101
DataWorks Summit/Hadoop Summit
 
The Hidden Value of Hadoop Migration
Databricks
 
Loan Decisioning Transformation
DataWorks Summit/Hadoop Summit
 
The modern analytics architecture
Joseph D'Antoni
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
The Stream is the Database - Revolutionizing Healthcare Data Architecture
DataWorks Summit/Hadoop Summit
 
Managing R&D Data on Parallel Compute Infrastructure
Databricks
 
Optimize Data for the Logical Data Warehouse
Attunity
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Flink Forward
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Delta Lake: Open Source Reliability w/ Apache Spark
George Chow
 
Disrupting Insurance with Advanced Analytics The Next Generation Carrier
DataWorks Summit/Hadoop Summit
 
Which data should you move to Hadoop?
Attunity
 
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
Cask Data
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 

Viewers also liked (11)

PPTX
Bad Data is Polluting Big Data
Streamsets Inc.
 
PPTX
MapR-DB – The First In-Hadoop Document Database
MapR Technologies
 
PPT
GT.M: A Tried and Tested Open-Source NoSQL Database
Rob Tweed
 
PDF
Real-World NoSQL Schema Design
DataWorks Summit/Hadoop Summit
 
PPTX
Hadoop and Enterprise Data Warehouse
DataWorks Summit
 
PPTX
Evolving from RDBMS to NoSQL + SQL
MapR Technologies
 
PPTX
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Cloudera, Inc.
 
PPTX
Spark SQL
Caserta
 
PPTX
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
StampedeCon
 
PPTX
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
DataWorks Summit
 
PDF
Intro to Spark and Spark SQL
jeykottalam
 
Bad Data is Polluting Big Data
Streamsets Inc.
 
MapR-DB – The First In-Hadoop Document Database
MapR Technologies
 
GT.M: A Tried and Tested Open-Source NoSQL Database
Rob Tweed
 
Real-World NoSQL Schema Design
DataWorks Summit/Hadoop Summit
 
Hadoop and Enterprise Data Warehouse
DataWorks Summit
 
Evolving from RDBMS to NoSQL + SQL
MapR Technologies
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Cloudera, Inc.
 
Spark SQL
Caserta
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
StampedeCon
 
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
DataWorks Summit
 
Intro to Spark and Spark SQL
jeykottalam
 
Ad

Similar to Enabling Next Gen Analytics with Azure Data Lake and StreamSets (20)

PPTX
Big Data: It’s all about the Use Cases
James Serra
 
PPTX
Big Data on Azure Tutorial
rustd
 
PDF
Data Driven Advanced Analytics using Denodo Platform on AWS
Denodo
 
PDF
Data Virtualization: Introduction and Business Value (UK)
Denodo
 
PPTX
Big Data and Analytics
Cameron. A. Bradbury
 
PPTX
Big Data and Analytics
Cameron. A. Bradbury
 
PDF
When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
PDF
Rethink Your 2021 Data Management Strategy with Data Virtualization (ASEAN)
Denodo
 
PPTX
Derfor skal du bruge en DataLake
Microsoft
 
PDF
Data and Application Modernization in the Age of the Cloud
redmondpulver
 
PPTX
Opportunity: Data, Analytic & Azure
Abhimanyu Singhal
 
PDF
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Dataconomy Media
 
PPTX
IBM Relay 2015: Open for Data
IBM
 
PPTX
Fast Data Strategy Houston Roadshow Presentation
Denodo
 
PDF
Big Data Companies and Apache Software
Bob Marcus
 
PPTX
Microsoft cloud big data strategy
James Serra
 
PDF
Denodo Partner Connect: A Review of the Top 5 Differentiated Use Cases for th...
Denodo
 
PDF
Accelerate Cloud Migrations and Architecture with Data Virtualization
Denodo
 
PDF
Best Practices in the Cloud for Data Management (US)
Denodo
 
PPTX
Build Big Data Enterprise Solutions Faster on Azure HDInsight
DataWorks Summit/Hadoop Summit
 
Big Data: It’s all about the Use Cases
James Serra
 
Big Data on Azure Tutorial
rustd
 
Data Driven Advanced Analytics using Denodo Platform on AWS
Denodo
 
Data Virtualization: Introduction and Business Value (UK)
Denodo
 
Big Data and Analytics
Cameron. A. Bradbury
 
Big Data and Analytics
Cameron. A. Bradbury
 
When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
Rethink Your 2021 Data Management Strategy with Data Virtualization (ASEAN)
Denodo
 
Derfor skal du bruge en DataLake
Microsoft
 
Data and Application Modernization in the Age of the Cloud
redmondpulver
 
Opportunity: Data, Analytic & Azure
Abhimanyu Singhal
 
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Dataconomy Media
 
IBM Relay 2015: Open for Data
IBM
 
Fast Data Strategy Houston Roadshow Presentation
Denodo
 
Big Data Companies and Apache Software
Bob Marcus
 
Microsoft cloud big data strategy
James Serra
 
Denodo Partner Connect: A Review of the Top 5 Differentiated Use Cases for th...
Denodo
 
Accelerate Cloud Migrations and Architecture with Data Virtualization
Denodo
 
Best Practices in the Cloud for Data Management (US)
Denodo
 
Build Big Data Enterprise Solutions Faster on Azure HDInsight
DataWorks Summit/Hadoop Summit
 
Ad

Recently uploaded (20)

PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PPTX
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
PDF
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PDF
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 

Enabling Next Gen Analytics with Azure Data Lake and StreamSets

  • 1. Enabling Next Gen Analytics with Azure Data Lake
  • 4. Big Data Definition Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. – Gartner, Big Data Definition* * Gartner, Big Data (Stamford, CT.: Gartner, 2016), URL: http://www.gartner.com/it-glossary/big-data/
  • 5. Big Data as a Cornerstone of Cortana Intelligence Action People Automated Systems Apps Web Mobile Bots Intelligence Dashboards & Visualizations Cortana Bot Framework Cognitive Services Power BI Information Management Event Hubs Data Catalog Data Factory Machine Learning and Analytics HDInsight (Hadoop and Spark) Stream Analytics Intelligence Data Lake Analytics Machine Learning Big Data Stores SQL Data Warehouse Data Lake Store Data Sources Apps Sensors and devices Data
  • 6. However, there are challenges to Big Data… Obtaining skills and capabilities Determining how to get value Integrating with existing IT investments *Gartner: Survey Analysis – Hadoop Adoption Drivers and Challenges (Stamford, CT.: Gartner, 2015)
  • 7. Azure HDInsight A Cloud Spark and Hadoop service for the Enterprise Reliable with an industry leading SLA Enterprise-grade security and monitoring Productive platform for developers and scientists Cost effective cloud scale Integration with leading ISV applications Easy for administrators to manage 63% lower TCO than deploy your own Hadoop on-premises* *IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”
  • 8. • One-click deploy experience for installing apps. • Fully managed PaaS offering. • Access to entire cluster and secure by default. • Install apps on new or existing clusters. • Ease of authoring and deployment. • Certified partners only. HDInsight Application Platform
  • 9. Hybrid cloud, a reality today 74% Enterprises believe a hybrid cloud will enable business growth1 82% Enterprises have a hybrid cloud strategy, up from 74 percent a year ago2 Workload requirements Regulation Sensitive data Customization Latency Legacy support
  • 11. Who is StreamSets? Enterprise Data DNA StreamSets Mission Top-tier Investors Commercial Customers Across Verticals 150,000 downloads ⅓ of the Fortune 100 Empower enterprises to harness their data in motion. Products StreamSets Dataflow Performance Manager™ (DPM) StreamSets Data Collector™ (open source) Strong Partner Ecosystem Open Source Success
  • 12. StreamSets Solution Desired Business Outcomes ● Developer & operator efficiency ● On-time delivery ● Data trust & governance Data in motion middleware that ensures data trust.
  • 13. StreamSets Dataflow Performance Manager (DPM) StreamSets Products StreamSets Data Collector (SDC) Open source tooling and engine to build complex any-to-any dataflows. Cloud Service to map, measure and master dataflow operations. DATAFLOW LIFECYCLE DEVELOP OPERATE EVOLVE (Proactive) REMEDIATE (Reactive) ● Developers ● Scientists ● Architects ● Operators ● Stewards ● Architects
  • 14. StreamSets Deployment Models Install on Local Machine Install on Azure VM
  • 16. StreamSets and Microsoft Azure in Use in a Major Bank
  • 17. The Customer ● Forbes Global 500 financial services company. ● Adopting and moving into cloud at rapid phase. ● Growing rapidly both via acquisitions and organic growth.
  • 18. Key Challenges Related to Data Movement ● Number of legacy tools both customer and vendor built. ● Security policy changes very hard to manage. ● Lack of security governance due to fragmentation of tools and lack of standardization. ● Difficulty onboarding new data sources as soon as the are created (technology change). ● Data drift (unexpected changes) very hard to manage at scale.
  • 19. Key Factors for the Customer to Consider Streamsets ● KPIs ● Delivery guarantees ● Multiple types of origins and destinations using a single tool. ● Works natively with Microsoft Azure as part of HDInsight or Azure Virtual Machine or deployed on premise. ● Visualization of actual data transfers. ● Define security boundaries, actors etc. ● Repeating pattern
  • 20. Customer’s Business Objectives ● Short Compute and Long Storage (ADLS,Azure Blob) in turn fine-grained cost control. ● Ability to build microanalytics framework. For instance, instead of taking entire dataset, build same micro datasets and build microanalytics framework and derive results faster (faster iteration). ● Move away from traditional Data Lake to Azure Data Lake to manage cost and scale.
  • 21. Use Cases for StreamSets Use Cases 1. Data Movement from On-Premise to Azure Data Lake 2. Consolidating Migration tools into single tool 3. Building DR for HDInsight Kafka workloads.
  • 22. Resources / Q & A StreamSets Data Collector @ Azure Marketplace https://azure.microsoft.com/en-us/marketplace/partners/streamsets/streamsets-data-collector/ Ingest Data into Microsoft Azure Data Lake (YouTube) https://www.youtube.com/watch?v=c1dVnOK7Luw StreamSets Community https://streamsets.com/community/ StreamSets Dataflow Performance Manager Product Information https://streamsets.com/products/dpm/