SlideShare a Scribd company logo
AzureDay North Poland
Gdynia 2016
Introduction to Big Data
Analytics?
Łukasz Grala | Senior Architect
Łukasz Grala
• Senior architekt rozwiązań Platformy Danych & Business Intelligence & Zaawansowanej Analityki w TIDK
• Twórca „Data Scientist as as Service”
• Certyfikowany trener Microsoft i wykładowca na wyższych uczelniach
• Autor zaawansowanych szkoleń i warsztatów, oraz licznych publikacji i webcastów
• Od 2010 roku wyróżniany nagrodą Microsoft Data Platform MVP
• Doktorant Politechnika Poznańska – Wydział Informatyki (obszar bazy danych, eksploracja danych, uczenie maszynowe)
• Prelegent na licznych konferencjach w kraju i na świecie
• Posiada liczne certyfikaty (MCT, MCSE, MCSA, MCITP,…)
• Członek Polskiego Towarzystwa Informatycznego
• Członek i lider Polish SQL Server User Group (PLSSUG)
• Pasjonat analizy, przechowywania i przetwarzania danych, miłośnik Jazzu
email lukasz@tidk.pl
Data
• 72 hours of video are uploaded per minute on YouTube (1
terabyte every 4 minutes)
• 500 terabytes of new data per day are ingested in Facebook
databases
• Sensors from a Boeing jet engine create 20 terabytes
of data every hour
• The proposed Square Kilometer Array telescope will generate “a
few Exabytes of data per day” (single beam)
Big Data
http://www.ibmbigdatahub.com/infographic/four-vs-big-data
4V
Volume Variety Velocity Veracity
• Validity
• Value
• Variability
• Venue
• Vocabulary
• Vagueness
Internet Of Things
New Modern BI Solution
ETL Tool
(SSIS, etc) EDW
(SQL Server, Teradata, etc)
Extract
Original Data
Load
Transformed
Data
Transform
BI Tools
Ingest (EL)
Original Data
Scale-out
Storage &
Compute
(HDFS, Blob Storage,
etc)
Transform & Load
Data Marts
Data Lake(s)
Dashboards
Apps
Streaming data
Time
Big Data
Storage
Processing
and
Analytics
Visualization
Visualization
Reports & Mobil Reports
Storage
Blob
SQL Database & SQL Data Warehouse
DocumentDB
HDInsight
Azure Data Lake Store
Azure Blob Storage
• Blob Storage
• Table Storage
• Queue Storage
• File Storage
SQL Database
& SQL Data Warehouse
SQL Database
& SQL Data Warehouse
DocumentDB
Analytics
Azure HDInsight
Azure Data Lake Analytics
Azure Stream Analytics
Azure Machine Learning
Azure Cognitive Services
Azure Data Lake
WebHDFS
YARN
U-SQL
Analytics Service HDInsight
(managed Hadoop Clusters)
Analytics
Store
Why Machine Learning
Analytics
Storage
HDInsight
(“managed clusters”)
Azure Data Lake Analytics
Azure Data Lake Storage
HDInsight
• HDInsight is a Hadoop-based service that brings 100% Apache
Hadoop solution running on the Microsoft Azure platform
• Based on the Hortonworks Data Platform (HDP)
• Scalable, on-demand service
HDInsight
Why Machine Learning
HDInsight & SQL Server
Query relational
and non-relational
data, on-premises
and in Azure
Apps
T-SQL query
SQL Server Hadoop
Azure Stream Analytics
Point of
Service Devices
Self Checkout
Stations
Kiosks
Smart
Phones
Slates/
Tablets
PCs/
Laptops
Servers
Digital
Signs
Diagnostic
EquipmentRemote Medical
Monitors
Logic
Controllers
Specialized
DevicesThin
Clients
Handhelds
Security
POS
Terminals
Automation
Devices
Vending
Machines
Kinect
ATM
Canonical Event-driven Scenario
Advanced Analytics
• Language R and Python
• Microsoft R Open, Microsoft R Server, R Services, CARN R,
Revolution
• Mahout
• SparkR
• MLLib
• Azure Machine Learning
• Azure Cognitive Services Models/API
Traditional Data Mining vs Big Data
Analysis
Traditional Big Data analysis
Memory access Data is stored in centralized RAM and
can be efficiently scanned several times
Data be stored on high distributed data
sources
In case of huge, continuous data
streams, data is accessed only in single
scan
Computional processing and
architectures
Serial, centralized processing
A single-computer platform that scales
with better hardware is sufficient
Parallel and distributed architectures
may be necessary
Cluster platforms that scale with several
nodes may be necessary
Data Types Data source is relatively homogeneous
Data is static and of resonable size
Data come from multiple data sources
which may be heterogeneous and
complex
Data may be dynamic and evolving.
Adapting to data changes may be
necessary
Traditional Data Mining vs Big Data
Analysis
Traditional Big Data analysis
Data management Data format is simple and fits in
relational database or data warehouse
Data access time is not critical
Data format are usually diverse and may
not fit in a relational database.
Data may be greatly interconnected and
needs to be integreted from several
nodes
Often special data systems are required
that manage varied data formats
(NoSQL, Databases, HADOOP,…)
Data acess time is critical for scalability
and speed
Data quality The provenance and pre-processing
steps are relatively well documented
Strong correction techniques were
applied
Data is relatively well tagged and
labeled
The provenance and pre-processing
steps may be unclear and
undocumented
There is a large amount of uncertainly
and imprecision in the data
Only small numer of data are tagged and
labeled
Traditional Data Mining vs Big Data
Analysis
Traditional Big Data analysis
Data processing Only batch learning is necessary
Learning can be slow and off-line
Data fits into memory
All the data has some sort of utility
Data may arrive in a stream and need
processed continuously
Learning need to be fast and online
The scalability of algorithms is important
Data not fit in memory
Azure Machine Learning
Cognitive Services
Question?
lukasz@tidk.pl

More Related Content

What's hot (20)

PPSX
Lean Data Lineage
Data to Value Ltd
 
PPTX
Give sense to your Big Data w/ Apache TinkerPop™ & property graph databases
DataStax
 
PPTX
Necessity of Data Lakes in the Financial Services Sector
DataWorks Summit
 
PPTX
Pentaho Analytics on MongoDB
Mark Kromer
 
PDF
Advanced Reporting and ETL for MongoDB: Easily Build a 360-Degree View of You...
MongoDB
 
PDF
Data Preparation of Data Science
DataWorks Summit/Hadoop Summit
 
PPT
Webinar: Proofpoint, a pioneer in security-as-a-service protects people, info...
DataStax
 
PDF
3 Reasons Data Virtualization Matters in Your Portfolio
Denodo
 
PDF
SFScon19 - Grazia Cazzin - KNOWAGE the open source answer to the new needs in...
South Tyrol Free Software Conference
 
PPTX
Introduction to BIG DATA
Zeeshan Khan
 
PDF
Callcenter HPE IDOL overview
Tania Akinina
 
PDF
Дмитрий Попович "How to build a data warehouse?"
Fwdays
 
PPTX
Rethink Analytics with an Enterprise Data Hub
Cloudera, Inc.
 
PDF
Big Data Use Cases
InSemble
 
PDF
Best Practices in the Cloud for Data Management (US)
Denodo
 
PDF
The Virtualization of Clouds - The New Enterprise Data Architecture Opportunity
Denodo
 
PPTX
Machine Learning in the Data Science Context
sisira samarasinghe
 
PPTX
Business Innovations Through Big Data Analytics - 30th November 2017
sisira samarasinghe
 
PDF
[Keynote HP] Guido Pezzin - Big Data - from theory to practice with the simpl...
Codemotion
 
PPTX
Digikrit Company Profile
Digikrit
 
Lean Data Lineage
Data to Value Ltd
 
Give sense to your Big Data w/ Apache TinkerPop™ & property graph databases
DataStax
 
Necessity of Data Lakes in the Financial Services Sector
DataWorks Summit
 
Pentaho Analytics on MongoDB
Mark Kromer
 
Advanced Reporting and ETL for MongoDB: Easily Build a 360-Degree View of You...
MongoDB
 
Data Preparation of Data Science
DataWorks Summit/Hadoop Summit
 
Webinar: Proofpoint, a pioneer in security-as-a-service protects people, info...
DataStax
 
3 Reasons Data Virtualization Matters in Your Portfolio
Denodo
 
SFScon19 - Grazia Cazzin - KNOWAGE the open source answer to the new needs in...
South Tyrol Free Software Conference
 
Introduction to BIG DATA
Zeeshan Khan
 
Callcenter HPE IDOL overview
Tania Akinina
 
Дмитрий Попович "How to build a data warehouse?"
Fwdays
 
Rethink Analytics with an Enterprise Data Hub
Cloudera, Inc.
 
Big Data Use Cases
InSemble
 
Best Practices in the Cloud for Data Management (US)
Denodo
 
The Virtualization of Clouds - The New Enterprise Data Architecture Opportunity
Denodo
 
Machine Learning in the Data Science Context
sisira samarasinghe
 
Business Innovations Through Big Data Analytics - 30th November 2017
sisira samarasinghe
 
[Keynote HP] Guido Pezzin - Big Data - from theory to practice with the simpl...
Codemotion
 
Digikrit Company Profile
Digikrit
 

Viewers also liked (20)

PDF
On Big Data Analytics - opportunities and challenges
Petteri Alahuhta
 
PDF
Introduction to Data Mining and Big Data Analytics
Big Data Engineering, Faculty of Engineering, Dhurakij Pundit University
 
PDF
Big-data analytics: challenges and opportunities
台灣資料科學年會
 
PPTX
Getting started with Scrum
Tecsisa
 
PPTX
Agile Data Warehousing
Davide Mauri
 
PPTX
Introduction to Big Data & Analytics
Prasad Chitta
 
PPTX
Building Your Big Data Analytics Strategy- Impetus Webinar
Impetus Technologies
 
PPTX
Agile data warehouse
Dao Vo
 
PDF
Data Mining- Big Data landscape
Frank Luong PMP,MBA
 
PPTX
Bancos colombia
ivanhhh
 
PPTX
Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...
Lauri Eloranta
 
PDF
Agile Data Warehouse Design for Big Data Presentation
Vishal Kumar
 
PPTX
Modern business intelligence
Michael Stephenson
 
PDF
Big Data Startups - Top Visualization and Data Analytics Startups
wallesplace
 
PDF
Agile Data Warehousing at Telstra, TDWI Melbourne, October 2013
Em Campbell-Pretty
 
PPT
Big Data
NGDATA
 
PDF
Big Data v Data Mining
University of Hertfordshire
 
PPTX
Big data ppt
Nasrin Hussain
 
PPTX
Introduction of Cloud computing
Rkrishna Mishra
 
On Big Data Analytics - opportunities and challenges
Petteri Alahuhta
 
Introduction to Data Mining and Big Data Analytics
Big Data Engineering, Faculty of Engineering, Dhurakij Pundit University
 
Big-data analytics: challenges and opportunities
台灣資料科學年會
 
Getting started with Scrum
Tecsisa
 
Agile Data Warehousing
Davide Mauri
 
Introduction to Big Data & Analytics
Prasad Chitta
 
Building Your Big Data Analytics Strategy- Impetus Webinar
Impetus Technologies
 
Agile data warehouse
Dao Vo
 
Data Mining- Big Data landscape
Frank Luong PMP,MBA
 
Bancos colombia
ivanhhh
 
Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...
Lauri Eloranta
 
Agile Data Warehouse Design for Big Data Presentation
Vishal Kumar
 
Modern business intelligence
Michael Stephenson
 
Big Data Startups - Top Visualization and Data Analytics Startups
wallesplace
 
Agile Data Warehousing at Telstra, TDWI Melbourne, October 2013
Em Campbell-Pretty
 
Big Data
NGDATA
 
Big Data v Data Mining
University of Hertfordshire
 
Big data ppt
Nasrin Hussain
 
Introduction of Cloud computing
Rkrishna Mishra
 
Ad

Similar to AzureDay - Introduction Big Data Analytics. (20)

PPTX
Microsoft cloud big data strategy
James Serra
 
PPTX
Derfor skal du bruge en DataLake
Microsoft
 
PPTX
How does Microsoft solve Big Data?
James Serra
 
PPTX
Big Data Analytics in the Cloud with Microsoft Azure
Mark Kromer
 
PDF
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Dataconomy Media
 
PDF
Trivadis Azure Data Lake
Trivadis
 
PPTX
Big Data Analytics: Finding diamonds in the rough with Azure
Christos Charmatzis
 
PDF
What’s new on the Microsoft Azure Data Platform
Joris Poelmans
 
PPTX
Big data analytics and machine intelligence v5.0
Amr Kamel Deklel
 
PPTX
Designing big data analytics solutions on azure
Mohamed Tawfik
 
PPTX
Microsoft Azure Big Data Analytics
Mark Kromer
 
PDF
Big data Analytics
ShivanandaVSeeri
 
PPTX
WyspaIT 2016 - Azure Stream Analytics i Azure Machine Learning w analizie str...
Łukasz Grala
 
DOCX
What are the basic key concepts before learning Azure Data Engineer.docx
Technogeeks
 
PDF
DAMA - Innovations in DG Architecture and Analytics (online)
Robert Quinn
 
PDF
Analytics in a Day Virtual Workshop
CCG
 
PPTX
Arquitectura de Datos en Azure
Elena Lopez
 
PDF
Analytics in a Day Ft. Synapse Virtual Workshop
CCG
 
PDF
1 Introduction to Microsoft data platform analytics for release
Jen Stirrup
 
PPTX
Build Big Data Enterprise Solutions Faster on Azure HDInsight
DataWorks Summit/Hadoop Summit
 
Microsoft cloud big data strategy
James Serra
 
Derfor skal du bruge en DataLake
Microsoft
 
How does Microsoft solve Big Data?
James Serra
 
Big Data Analytics in the Cloud with Microsoft Azure
Mark Kromer
 
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Dataconomy Media
 
Trivadis Azure Data Lake
Trivadis
 
Big Data Analytics: Finding diamonds in the rough with Azure
Christos Charmatzis
 
What’s new on the Microsoft Azure Data Platform
Joris Poelmans
 
Big data analytics and machine intelligence v5.0
Amr Kamel Deklel
 
Designing big data analytics solutions on azure
Mohamed Tawfik
 
Microsoft Azure Big Data Analytics
Mark Kromer
 
Big data Analytics
ShivanandaVSeeri
 
WyspaIT 2016 - Azure Stream Analytics i Azure Machine Learning w analizie str...
Łukasz Grala
 
What are the basic key concepts before learning Azure Data Engineer.docx
Technogeeks
 
DAMA - Innovations in DG Architecture and Analytics (online)
Robert Quinn
 
Analytics in a Day Virtual Workshop
CCG
 
Arquitectura de Datos en Azure
Elena Lopez
 
Analytics in a Day Ft. Synapse Virtual Workshop
CCG
 
1 Introduction to Microsoft data platform analytics for release
Jen Stirrup
 
Build Big Data Enterprise Solutions Faster on Azure HDInsight
DataWorks Summit/Hadoop Summit
 
Ad

More from Łukasz Grala (20)

PPTX
Cognitive Toolkit - Deep Learning framework from Microsoft
Łukasz Grala
 
PPTX
DataMass Summit - Machine Learning for Big Data in SQL Server
Łukasz Grala
 
PPTX
WhyR? Analiza sentymentu
Łukasz Grala
 
PPTX
Microsoft ML - State of The Art Microsoft Machine Learning - Package R
Łukasz Grala
 
PPTX
AnalyticsConf2016 - Innowacyjność poprzez inteligentną analizę informacji - C...
Łukasz Grala
 
PPTX
AnalyticsConf2016 - Zaawansowana analityka na platformie Azure HDInsight
Łukasz Grala
 
PPTX
eRum2016 -RevoScaleR - Performance and Scalability R
Łukasz Grala
 
PPTX
AzureDay - What is Machine Learnin?
Łukasz Grala
 
PPTX
3 CityNetConf - sql+c#=u-sql
Łukasz Grala
 
PPTX
20060416 Azure Boot Camp 2016- Azure Data Lake Storage and Analytics
Łukasz Grala
 
PPTX
20160405 Cloud Community Poznań - Cloud Analytics on Azure
Łukasz Grala
 
PPTX
20160309 AzureDay 2016 - Azure Stream Analytics & Azure Machine Learning
Łukasz Grala
 
PPTX
20160317 - PAZUR - PowerBI & R
Łukasz Grala
 
PPTX
20160316 techstolica - cloudstorage -tidk
Łukasz Grala
 
PPTX
20160316 techstolica - cloudanalytics -tidk
Łukasz Grala
 
PPTX
Prescriptive Analytics
Łukasz Grala
 
PPTX
DAC4B 2015 - Polybase
Łukasz Grala
 
PPTX
Expert summit SQL Server 2016
Łukasz Grala
 
PPTX
Nowy SQL Server 2012 – DENALI rewolucją w silnikach baz danych - Microsoft te...
Łukasz Grala
 
PPTX
Pre mts Sharepoint 2010 i SQL Server 2012
Łukasz Grala
 
Cognitive Toolkit - Deep Learning framework from Microsoft
Łukasz Grala
 
DataMass Summit - Machine Learning for Big Data in SQL Server
Łukasz Grala
 
WhyR? Analiza sentymentu
Łukasz Grala
 
Microsoft ML - State of The Art Microsoft Machine Learning - Package R
Łukasz Grala
 
AnalyticsConf2016 - Innowacyjność poprzez inteligentną analizę informacji - C...
Łukasz Grala
 
AnalyticsConf2016 - Zaawansowana analityka na platformie Azure HDInsight
Łukasz Grala
 
eRum2016 -RevoScaleR - Performance and Scalability R
Łukasz Grala
 
AzureDay - What is Machine Learnin?
Łukasz Grala
 
3 CityNetConf - sql+c#=u-sql
Łukasz Grala
 
20060416 Azure Boot Camp 2016- Azure Data Lake Storage and Analytics
Łukasz Grala
 
20160405 Cloud Community Poznań - Cloud Analytics on Azure
Łukasz Grala
 
20160309 AzureDay 2016 - Azure Stream Analytics & Azure Machine Learning
Łukasz Grala
 
20160317 - PAZUR - PowerBI & R
Łukasz Grala
 
20160316 techstolica - cloudstorage -tidk
Łukasz Grala
 
20160316 techstolica - cloudanalytics -tidk
Łukasz Grala
 
Prescriptive Analytics
Łukasz Grala
 
DAC4B 2015 - Polybase
Łukasz Grala
 
Expert summit SQL Server 2016
Łukasz Grala
 
Nowy SQL Server 2012 – DENALI rewolucją w silnikach baz danych - Microsoft te...
Łukasz Grala
 
Pre mts Sharepoint 2010 i SQL Server 2012
Łukasz Grala
 

Recently uploaded (20)

PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PPTX
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
PDF
apidays Munich 2025 - Making Sense of AI-Ready APIs in a Buzzword World, Andr...
apidays
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PPTX
Insurance-Analytics-Branch-Dashboard (1).pptx
trivenisapate02
 
PPTX
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
PDF
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
PPTX
UVA-Ortho-PPT-Final-1.pptx Data analytics relevant to the top
chinnusindhu1
 
PPTX
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
PDF
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
PDF
Top Civil Engineer Canada Services111111
nengineeringfirms
 
PPTX
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PPTX
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
PPTX
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
apidays Munich 2025 - Making Sense of AI-Ready APIs in a Buzzword World, Andr...
apidays
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
Insurance-Analytics-Branch-Dashboard (1).pptx
trivenisapate02
 
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
UVA-Ortho-PPT-Final-1.pptx Data analytics relevant to the top
chinnusindhu1
 
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
Top Civil Engineer Canada Services111111
nengineeringfirms
 
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 

AzureDay - Introduction Big Data Analytics.

  • 2. Introduction to Big Data Analytics? Łukasz Grala | Senior Architect
  • 3. Łukasz Grala • Senior architekt rozwiązań Platformy Danych & Business Intelligence & Zaawansowanej Analityki w TIDK • Twórca „Data Scientist as as Service” • Certyfikowany trener Microsoft i wykładowca na wyższych uczelniach • Autor zaawansowanych szkoleń i warsztatów, oraz licznych publikacji i webcastów • Od 2010 roku wyróżniany nagrodą Microsoft Data Platform MVP • Doktorant Politechnika Poznańska – Wydział Informatyki (obszar bazy danych, eksploracja danych, uczenie maszynowe) • Prelegent na licznych konferencjach w kraju i na świecie • Posiada liczne certyfikaty (MCT, MCSE, MCSA, MCITP,…) • Członek Polskiego Towarzystwa Informatycznego • Członek i lider Polish SQL Server User Group (PLSSUG) • Pasjonat analizy, przechowywania i przetwarzania danych, miłośnik Jazzu email [email protected]
  • 4. Data • 72 hours of video are uploaded per minute on YouTube (1 terabyte every 4 minutes) • 500 terabytes of new data per day are ingested in Facebook databases • Sensors from a Boeing jet engine create 20 terabytes of data every hour • The proposed Square Kilometer Array telescope will generate “a few Exabytes of data per day” (single beam)
  • 6. 4V Volume Variety Velocity Veracity • Validity • Value • Variability • Venue • Vocabulary • Vagueness
  • 8. New Modern BI Solution ETL Tool (SSIS, etc) EDW (SQL Server, Teradata, etc) Extract Original Data Load Transformed Data Transform BI Tools Ingest (EL) Original Data Scale-out Storage & Compute (HDFS, Blob Storage, etc) Transform & Load Data Marts Data Lake(s) Dashboards Apps Streaming data
  • 12. Reports & Mobil Reports
  • 13. Storage Blob SQL Database & SQL Data Warehouse DocumentDB HDInsight Azure Data Lake Store
  • 14. Azure Blob Storage • Blob Storage • Table Storage • Queue Storage • File Storage
  • 15. SQL Database & SQL Data Warehouse
  • 16. SQL Database & SQL Data Warehouse
  • 18. Analytics Azure HDInsight Azure Data Lake Analytics Azure Stream Analytics Azure Machine Learning Azure Cognitive Services
  • 19. Azure Data Lake WebHDFS YARN U-SQL Analytics Service HDInsight (managed Hadoop Clusters) Analytics Store
  • 20. Why Machine Learning Analytics Storage HDInsight (“managed clusters”) Azure Data Lake Analytics Azure Data Lake Storage
  • 21. HDInsight • HDInsight is a Hadoop-based service that brings 100% Apache Hadoop solution running on the Microsoft Azure platform • Based on the Hortonworks Data Platform (HDP) • Scalable, on-demand service
  • 24. HDInsight & SQL Server Query relational and non-relational data, on-premises and in Azure Apps T-SQL query SQL Server Hadoop
  • 25. Azure Stream Analytics Point of Service Devices Self Checkout Stations Kiosks Smart Phones Slates/ Tablets PCs/ Laptops Servers Digital Signs Diagnostic EquipmentRemote Medical Monitors Logic Controllers Specialized DevicesThin Clients Handhelds Security POS Terminals Automation Devices Vending Machines Kinect ATM
  • 27. Advanced Analytics • Language R and Python • Microsoft R Open, Microsoft R Server, R Services, CARN R, Revolution • Mahout • SparkR • MLLib • Azure Machine Learning • Azure Cognitive Services Models/API
  • 28. Traditional Data Mining vs Big Data Analysis Traditional Big Data analysis Memory access Data is stored in centralized RAM and can be efficiently scanned several times Data be stored on high distributed data sources In case of huge, continuous data streams, data is accessed only in single scan Computional processing and architectures Serial, centralized processing A single-computer platform that scales with better hardware is sufficient Parallel and distributed architectures may be necessary Cluster platforms that scale with several nodes may be necessary Data Types Data source is relatively homogeneous Data is static and of resonable size Data come from multiple data sources which may be heterogeneous and complex Data may be dynamic and evolving. Adapting to data changes may be necessary
  • 29. Traditional Data Mining vs Big Data Analysis Traditional Big Data analysis Data management Data format is simple and fits in relational database or data warehouse Data access time is not critical Data format are usually diverse and may not fit in a relational database. Data may be greatly interconnected and needs to be integreted from several nodes Often special data systems are required that manage varied data formats (NoSQL, Databases, HADOOP,…) Data acess time is critical for scalability and speed Data quality The provenance and pre-processing steps are relatively well documented Strong correction techniques were applied Data is relatively well tagged and labeled The provenance and pre-processing steps may be unclear and undocumented There is a large amount of uncertainly and imprecision in the data Only small numer of data are tagged and labeled
  • 30. Traditional Data Mining vs Big Data Analysis Traditional Big Data analysis Data processing Only batch learning is necessary Learning can be slow and off-line Data fits into memory All the data has some sort of utility Data may arrive in a stream and need processed continuously Learning need to be fast and online The scalability of algorithms is important Data not fit in memory