SlideShare a Scribd company logo
Use of Big Data Architecture
Atif Farid Mohammad, PhD
Data Science Professor, Adjunct
UNC Charlotte
Big Data
• How Big it is?
• Does it matter?
• ?
• ?
• ?
• ?
• There can be many more questions…
Word of Caution
•Kindly Avoid Thinking in SQL Mode
•For this talk’s time period…
Big data meet_up_08042016
Big data meet_up_08042016
VS
Differences… RDBMS vs. Hadoop
RDBMS
• Schema
• Required on the Write
• Speed
• Reads are Fast
• Governance
• Standard and Structured
• Processing
• Limited, No Data Processing
• Data Types
• Structured
Hadoop
• Schema
• Required on the Read
• Speed
• Writes are Fast
• Governance
• Loosely Structured
• Processing
• Processing coupled with data
• Data Types
• Multi and Unstructured
RDBMS vs. Hadoop
• Schema
• Required on the Write
• Speed
• Reads are Fast
• Governance
• Standard and Structured
• Processing
• Limited, No Data Processing
• Data Types
• Structured
• Schema
• Required on the Read
• Speed
• Writes are Fast
• Governance
• Loosely Structured
• Processing
• Processing coupled with data
• Data Types
• Multi and Unstructured
Attributes IT Systems Hadoop
Data Size Gigabytes Peta/Zeta Bytes
Access Batch & Interactive Batch
CRUD Read & Write Many Times Write Once, Read Many
Times
Structure Static Dynamic
Integrity Normalization De-Normalization
Scalability Non-Linear Linear
Differences between IT Systems and Hadoop
A Scenario to Understand Big Data
•A Trucking Company collects… Using…???
A Scenario to Understand Big Data…
• GPS
• Speed
• Acceleration
• Stopping
• Normal
• To Quick
• Driving to Close to other Vehicles
What Standard Technologies You will use???
Hadoop EcoSystem Utilization
• Flume to get raw sensor data
• Sqoop to transport data to HDFS about
• Driver
• Vehicle
• Hcatalog to have all schema definition
• Hive to analyze Gas Milage
• Pig to compute Risk Factor for each Truck Driver based on his/her
related events
• Spark to create Data Sets by applying Machine Learning
Anothor Example - Bank
17
Data Acquisition
• Input
• Multiple user event feeds (browsing activities, search etc.) per time period
User Time Event Source
U1 T0 visited Bank Site Server logs
U1 T1 searched for “Credit Cards” Search logs
U1 T2 browsed Banking Services Web server logs
U1 T3 Saw an e-Mail sent link Link advertising logs
U1 T4 Used OLTP Web server logs
U1 T5 clicked on an ad for “some insurance” Ad logs, click server logs
18
Data Acquisition for the Landing Zone
Event
Feeds
User
event Normalized
Events (NE)
User
event
User
event
Project relevant
event attributes
Filter irrelevant
events
Tag and Transform
• Categorization
• Topic
• ….
HDFSUser
event
User
event
User
event
Map Operations
19
Data Acquisition for the Landing Zone
• Output:
• Single normalized feed containing all events for all users per time period
User Time Event Tag
U1 T0 Content browsing Web clicks by a Bank’s user
U2 T2 Search query Category: Credit Card
… … ……. ………
... … ……. ………
U23 T23 OLTP usage Drop event
U36 T36 Bank’s site page click Category: Some product
20
Feature and Target Generation for the Discovery
Zone
• Features:
• Summaries of user activities over a time window
• Aggregates, Moving averages, Rates etc. over moving time windows
• Support online updates to existing features
• Targets:
• Constructed in the offline model training phase
• Typically user actions in the future time period indicating interest
• Clicks/Click-through financial product offering and content
• Site and page visits
• Conversion events
• Deposit, Withdrawal, Quote requests etc.
• Sign-ups to newsletters, Registrations etc.
21
Feature Generation for Discovery Zone
NE 1
Feature
Set
HDFSNE 4
NE 2
NE 5 NE 6
NE 3
NE 7 NE 8 NE 9
Aggregate
Normalized
events
Map 1
U1, Event 1
Map 2
U1, Event 2
Map 3
U1, Event 2
Reduce 1 Reduce 2
All events for U1
U2, Event 2 U2, Event 3 U2, Event 1
All events for U2
Summaries over
user event history
Aggregates within window
Time and event weighted averages
Event rates
……..
22
Modeling Workflow within the Discovery Zone
Target generation
Feature generation
Data Acquisition
User
event
history
Targets
Features
Model Training
Weights
Training
Phase
Target generation
Feature generation
Data Acquisition
User
event
history
Targets
Features
Evaluation
Phase
Model Scoring
Evaluation
Scores
23
Batch Scoring for Discovery Results
Data Acquisition
User
event
history
Feature generation
Features
Online Serving
Systems
Model Scoring
Scores
Weights
24
Discovery Zone Pipeline System Estimation
Component Data Processed Time Estimation
Data Acquisition ~ 1 Tb per time period 2 – 3 hours
Feature and Target
Generation
~ 1 Tb * Size of feature
window
4 - 6 hours
Model Training ~ 50 - 100 Gb 1 – 2 hours for 100’s of
models
Scoring ~ 500 Gb 1 hour
Requirements Extraction Process
• Two-step process is used for requirement extraction:
1) Extract specific requirements and map to reference architecture based on each application’s
characteristics such as:
a) data sources (data size, file formats, rate of grow, at rest or in motion, etc.)
b) data lifecycle management (curation, conversion, quality check, pre-analytic processing, etc.)
c) data transformation (data fusion/mashup, analytics),
d) capability infrastructure (software tools, platform tools, hardware resources such as storage and
networking), and
e) data usage (processed results in text, table, visual, and other formats).
f) all architecture components informed by Goals and use case description
g) Security & Privacy has direct map
2) Aggregate all specific requirements into high-level generalized requirements which are
vendor-neutral and technology agnostic.
25
Cloud
Business Intelligence
 Data Analyses
 Data Cleansing
 Entity Relationship Modeling
 Dimensional Modeling
 Database Design & Implementation
 Database Population through ETL/ELT
 Downstream Applications linkage - Metadata
 Maintaining the processes
Source
Data
Extensive processes and costs:
Big Data Edge from Data Warehouse
Data Marts
Analytical
Database
Analytical
Database
Analytical
Database
Analytical
Database
Analytical
Database
Metadata Management
Security and Data Privacy
System Management and Administration
Network Connectivity, Protocols & Access Middleware
Hardware & Software Platforms
Web Browser
Portals
Devices
(ex.: mobile)
Web Services
Access
Collaboration
BusinessApplications
Query &
Reporting
Data Mining
Modeling
Scorecard
Visualization
Embedded
Analytics
Analytics
Operational
Data Stores
Data
Warehouse
Data Marts
Staging
Areas
Metadata
Data Repositories
Extraction
Transformation
Load / Apply
Synchronization
Transport /
Messaging
Information Integrity
Data Integration
Data Flow and Workflow
Enterprise
Unstructured
Informational
External
Data Sources
Supplier
Orders
Product
Promotions
Customer
Location
Invoice
ePOS
Other
HDFS
Analytical
Data Marts
HCatalog
Data Lake
Sqoop
MapReduce/PIG
Load / Apply
Single Source
HCatalog & Pig
Can work with most ETL tools
on the market
Transport /
Messaging
Metadata Management - HCatalog
Metadata Management
Security and Data Privacy
System Management and Administration
Network Connectivity, Protocols & Access Middleware
Hardware & Software Platforms
Web Browser
Portals
Devices
(ex.: mobile)
Web Services
Access
Collaboration
BusinessApplications
Query &
Reporting
Data Mining
Modeling
Scorecard
Visualization
Embedded
Analytics
Analytics
Operational
Data Stores
Data
Warehouse
Data Marts
Staging
Areas
Metadata
Data Repositories
Extraction
Transformation
Load / Apply
Synchronization
Transport /
Messaging
Information Integrity
Data Integration
Data Flow and Workflow
Enterprise
Unstructured
Informational
External
Data Sources
Supplier
Orders
Product
Promotions
Customer
Location
Invoice
ePOS
Other
Reference Architecture
Metadata Management
Security and Data Privacy
System Management and Administration
Network Connectivity, Protocols & Access Middleware
Hardware & Software Platforms
Web Browser
Portals
Devices
(ex.: mobile)
Web Services
Access
Collaboration
BusinessApplications
Query &
Reporting
Data Mining
Modeling
Scorecard
Visualization
Embedded
Analytics
Analytics
Operational
Data Stores
Data
Warehouse
Data Marts
Staging
Areas
Metadata
Data Repositories
Extraction
Transformation
Load / Apply
Synchronization
Transport /
Messaging
Information Integrity
Data Integration
Data Flow and Workflow
Enterprise
Unstructured
Informational
External
Data Sources
Supplier
Orders
Product
Promotions
Customer
Location
Invoice
ePOS
Other
Transport /
Messaging
HCatalog – Hadoop metadata repository and management
service that provides a centralized way for data processing systems
to understand the structure and location of the data stored within
Apache Hadoop.
Extraction is an application used to transfer data, usually from
relational databases to a flat file, which can then be use to transport to a
landing are of a Data Warehouse and ingest into BI/DW environment.
Reference Architecture
Extraction
Sqoop – is a command-line interface application for transferring data between relational
databases and Hadoop. It supports incremental loads of a single table or a free form SQL query
as well as saved jobs which can be run multiple times to import updates made to a database
since the last import. Exports can be used to put data from Hadoop into a relational database.
Source
Extract Target Source Target
Sqoop
Current BI Proposed BI
sftp
Database extract
MapReduce – A framework for writing applications that processes large amounts of
structured and unstructured data in parallel across large clusters of machines in a very reliable
and fault-tolerant manner.
Pig – A platform for processing and analyzing large data sets. Pig consists on a high-level
language (Pig Latin) for expressing data analysis programs paired with the MapReduce
framework for processing these programs.
Transformation
Landing
Staging
DW
HDFS
DM
Current BI Proposed BI
DM
MapReduce/PigComplex ETL
Complex ETL
Complex ETL
Load / Apply
Staging
DW
DM
Current BI Proposed BI
DM
Synchronization
Synchronization – The ETL process takes source data from staging, transforms using
business rules and loads into central repository DW. In this scenario, in order to retain
information integrity, one has to put in place a synchronization checks & correction mechanism.
HDFS as a Single Source – In the proposed solution HDFS acts as a single source of
data so there is no danger of desinhronization. The inconsistencies resulted from duplicated or
inconsistent data will be reconciled with assistance of HCatalog and proper data governance.
Staging
DW
Landing
Synchronization
Source DM
HDFSSource DM
Information Integrity
Current – Currently there is no special approach to the data quality other than
imbedded into the ETL processes and logic. There are tools and approaches to
implement QA & QC.
Hadoop – More focused approach - While we use HDFS as a one big “Data Lake” QA
and QC will be applied at the Data Mart Level where the actual transformations will
occur, hence reducing the overall effort. QA & QC will be an integral part of Data
Governance and augmented by usage of HCatalog.
Metadata Management
Security and Data Privacy
System Management and Administration
Network Connectivity, Protocols & Access Middleware
Hardware & Software Platforms
Web Browser
Portals
Devices
(ex.: mobile)
Web Services
Access
Collaboration
BusinessApplications
Query &
Reporting
Data Mining
Modeling
Scorecard
Visualization
Embedded
Analytics
Analytics
Operational
Data Stores
Data
Warehouse
Data Marts
Staging
Areas
Metadata
Data Repositories
Extraction
Transformation
Load / Apply
Synchronization
Transport /
Messaging
Information Integrity
Data Integration
Data Flow and Workflow
Enterprise
Unstructured
Informational
External
Data Sources
Supplier
Orders
Product
Promotions
Customer
Location
Invoice
ePOS
Other
Data Repositories
Operational
Data Stores
Data
Warehouse
Data Marts
Staging
Areas
Metadata
HDFS
HCatalog
HCatalog Metadata Management
HCatalog – A Hadoop metadata repository and management service
that provides a centralized way for data processing systems to understand
the structure and location of the data stored within Apache Hadoop.
Reference Architecture
Hadoop Distributed File System (HDFS) – A reliable and distributed Java-based file
system that allows large volumes of data to be stored and rapidly accessed across large
clusters of commodity servers
HCatalog Metadata Management
Security and Data Privacy
System Management and Administration
Network Connectivity, Protocols & Access Middleware
Hardware & Software Platforms
Web Browser
Portals
Devices
(ex.: mobile)
Web Services
Access
Collaboration
BusinessApplications
Query &
Reporting
Data Mining
Modeling
Scorecard
Visualization
Embedded
Analytics
Analytics
Data Flow and Workflow
Enterprise
Unstructured
Informational
External
Data Sources
Supplier
Orders
Product
Promotions
Customer
Location
Invoice
ePOS
Other
HDFS
Analytical
Data Marts
HCatalog
Data Repositories
Sqoop
MapReduce/PIG
Load / Apply
Single Source
HCatalog & Pig
Can work with Informatica
Data Integration
Transport /
Messaging
Reference Architecture
Capability Current BI Proposed BI Expected
Change
Data Sources Source Applications Source Applications No
Data Integration
Extraction from Source DB Export Sqoop On-to-one change
Transport/Messaging SFTP SFTP No
Staging Area
Transformations/Load
Complex ETL Code None required eliminated
Extract from Staging Complex ETL Code None required eliminated
Transformation for DW Complex ETL Code None required eliminated
Load to DW Complex ETL, RDBMS None required eliminated
Extract from from DW,
Transformation and load to DM
Complex ETL code & process to feed DM MapReduce/Pig simplified transformations
from HDFS to DM
Data Quality , Balance & Controls mbedded ETL Code MapReduce/Pig in conjunction
with HCatalog; Can also coexist
with Informatica
Yes
Reference Architecture
Map Reduce
Map Operation
MAP: Input data  <key, value> pair
Data
Collection: split1
web 1
weed 1
green 1
sun 1
moon 1
land 1
part 1
web 1
green 1
… 1
KEY VALUE
Split the data to
Supply multiple
processors
Data
Collection: split 2
Data
Collection: split n
Map
…… Map
34
web 1
weed 1
green 1
sun 1
moon 1
land 1
part 1
web 1
green 1
… 1
KEY VALUE
web 1
weed 1
green 1
sun 1
moon 1
land 1
part 1
web 1
green 1
… 1
KEY VALUE
web 1
weed 1
green 1
sun 1
moon 1
land 1
part 1
web 1
green 1
… 1
KEY VALUE
web 1
weed 1
green 1
sun 1
moon 1
land 1
part 1
web 1
green 1
… 1
KEY VALUE
…
Reduce
Reduce
Reduce
Reduce Operation
MAP: Input data  <key, value> pair
REDUCE: <key, value> pair  <result>
Data
Collection: split1 Split the data to
Supply multiple
processors
Data
Collection: split 2
Data
Collection: split n Map
Map
…… Map
35
…
Cat
Bat
Dog
Other
Words
(size:
TByte)
map
map
map
map
split
split
split
split
combine
combine
combine
reduce
reduce
reduce
part0
part1
part2
MapReduce
36
CountCountCount
Large scale data splits
Parse-hash
Parse-hash
Parse-hash
Parse-hash
Map <key, 1> Reducers (say, Count)
P-0000
P-0001
P-0002
, count1
, count2
,count3
37
Web References
• “MapReduce: Simplified Data Processing on Large Clusters”, Jeffrey Dean and
Sanjay Ghemawat, December 2004.
http://labs.google.com/papers/mapreduce.html
• “Scalable SQL”, ACM Queue, Michael Rys, April 19, 2011
http://queue.acm.org/detail.cfm?id=1971597
• “a practical guide to noSQL”, Posted by Denise Miura on March 17, 2011 at
http://blogs.marklogic.com/2011/03/17/a-practical-guide-to-nosql/
Thank you
Questions…

More Related Content

PPTX
Web mining
SwarnaLatha177
 
PPTX
Analytical tools
Aniket Joshi
 
PDF
ExecutiveWhitePaper
Jeremy Villar
 
PDF
ExecutiveWhitePaper
Anthony Parziale
 
PDF
Architecture for Real-Time and Batch Big Data Analytics
Nir Rubinstein
 
PDF
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Hortonworks
 
PPTX
How to build a successful Data Lake
DataWorks Summit/Hadoop Summit
 
PDF
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
HPCC Systems
 
Web mining
SwarnaLatha177
 
Analytical tools
Aniket Joshi
 
ExecutiveWhitePaper
Jeremy Villar
 
ExecutiveWhitePaper
Anthony Parziale
 
Architecture for Real-Time and Batch Big Data Analytics
Nir Rubinstein
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Hortonworks
 
How to build a successful Data Lake
DataWorks Summit/Hadoop Summit
 
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
HPCC Systems
 

What's hot (20)

PPTX
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
MongoDB
 
PDF
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Denodo
 
PPTX
Solution architecture for big data projects
Sandeep Sharma IIMK Smart City,IoT,Bigdata,Cloud,BI,DW
 
PDF
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Zaloni
 
PDF
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Rittman Analytics
 
PDF
Creating a Modern Data Architecture
Zaloni
 
ODP
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
Romeo Kienzler
 
PDF
The Warranty Data Lake – After, Inc.
Richard Vermillion
 
PDF
Strata San Jose 2017 - Ben Sharma Presentation
Zaloni
 
PDF
Modern Big Data Analytics Tools: An Overview
Great Wide Open
 
PPTX
Why and how to leverage the simplicity and power of SQL on Flink
DataWorks Summit
 
PPTX
Operating a secure big data platform in a multi-cloud environment
DataWorks Summit
 
PDF
Webinar: Is Spark Hadoop's Friend or Foe?
Zaloni
 
PDF
Understanding Metadata: Why it's essential to your big data solution and how ...
Zaloni
 
PPTX
Big data architectures and the data lake
James Serra
 
PPSX
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Institute of Contemporary Sciences
 
PPTX
Design Principles for a Modern Data Warehouse
Rob Winters
 
PPTX
StreamCentral Technical Overview
Raheel Retiwalla
 
PPT
Ultralight Data Movement for IoT with SDC Edge
DataWorks Summit
 
PDF
Designing the Next Generation Data Lake
Robert Chong
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
MongoDB
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Denodo
 
Solution architecture for big data projects
Sandeep Sharma IIMK Smart City,IoT,Bigdata,Cloud,BI,DW
 
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Zaloni
 
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Rittman Analytics
 
Creating a Modern Data Architecture
Zaloni
 
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
Romeo Kienzler
 
The Warranty Data Lake – After, Inc.
Richard Vermillion
 
Strata San Jose 2017 - Ben Sharma Presentation
Zaloni
 
Modern Big Data Analytics Tools: An Overview
Great Wide Open
 
Why and how to leverage the simplicity and power of SQL on Flink
DataWorks Summit
 
Operating a secure big data platform in a multi-cloud environment
DataWorks Summit
 
Webinar: Is Spark Hadoop's Friend or Foe?
Zaloni
 
Understanding Metadata: Why it's essential to your big data solution and how ...
Zaloni
 
Big data architectures and the data lake
James Serra
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Institute of Contemporary Sciences
 
Design Principles for a Modern Data Warehouse
Rob Winters
 
StreamCentral Technical Overview
Raheel Retiwalla
 
Ultralight Data Movement for IoT with SDC Edge
DataWorks Summit
 
Designing the Next Generation Data Lake
Robert Chong
 
Ad

Viewers also liked (20)

PDF
Cloudera User Group SF - Cloudera Manager: APIs & Extensibility
ClouderaUserGroups
 
PPTX
Apache Zeppelin and Spark for Enterprise Data Science
Bikas Saha
 
PPTX
Oozie meetup - HA
Mona Chitnis
 
PPTX
Atlas and ranger epam meetup
Alex Zeltov
 
PPTX
Hadoop security @ Philly Hadoop Meetup May 2015
Shravan (Sean) Pabba
 
PDF
TriHUG 2/14: Apache Sentry
trihug
 
PDF
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016
Sergio Fernández
 
PPTX
Introduction to sentry
mozillazg
 
PDF
April 2014 HUG : Apache Sentry
Yahoo Developer Network
 
PDF
Simplifying Big Data Integration with Syncsort DMX and DMX-h
Precisely
 
PPTX
Curb your insecurity with HDP
DataWorks Summit/Hadoop Summit
 
PDF
Apache Zeppelin 소개
KSLUG
 
PDF
Interactive Data Science Notebooks with Apache Zeppelin
Georg Sorst
 
PDF
Apache ranger meetup
nvvrajesh
 
PPTX
Streamline Hadoop DevOps with Apache Ambari
DataWorks Summit/Hadoop Summit
 
PDF
A gentle intro of Apache zeppelin
Ahyoung Ryu
 
PDF
Hadoop Security
Timothy Spann
 
PDF
Curb your insecurity with HDP - Tips for a Secure Cluster
ahortonworks
 
PPTX
Open Source Security Tools for Big Data
Rommel Garcia
 
PDF
Hadoop Operations for Production Systems (Strata NYC)
Kathleen Ting
 
Cloudera User Group SF - Cloudera Manager: APIs & Extensibility
ClouderaUserGroups
 
Apache Zeppelin and Spark for Enterprise Data Science
Bikas Saha
 
Oozie meetup - HA
Mona Chitnis
 
Atlas and ranger epam meetup
Alex Zeltov
 
Hadoop security @ Philly Hadoop Meetup May 2015
Shravan (Sean) Pabba
 
TriHUG 2/14: Apache Sentry
trihug
 
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016
Sergio Fernández
 
Introduction to sentry
mozillazg
 
April 2014 HUG : Apache Sentry
Yahoo Developer Network
 
Simplifying Big Data Integration with Syncsort DMX and DMX-h
Precisely
 
Curb your insecurity with HDP
DataWorks Summit/Hadoop Summit
 
Apache Zeppelin 소개
KSLUG
 
Interactive Data Science Notebooks with Apache Zeppelin
Georg Sorst
 
Apache ranger meetup
nvvrajesh
 
Streamline Hadoop DevOps with Apache Ambari
DataWorks Summit/Hadoop Summit
 
A gentle intro of Apache zeppelin
Ahyoung Ryu
 
Hadoop Security
Timothy Spann
 
Curb your insecurity with HDP - Tips for a Secure Cluster
ahortonworks
 
Open Source Security Tools for Big Data
Rommel Garcia
 
Hadoop Operations for Production Systems (Strata NYC)
Kathleen Ting
 
Ad

Similar to Big data meet_up_08042016 (20)

PPTX
Big Data Practice_Planning_steps_RK
Rajesh Jayarman
 
PDF
Hadoop - Architectural road map for Hadoop Ecosystem
nallagangus
 
PDF
Modern data warehouse
Stephen Alex
 
PDF
Modern data warehouse
Stephen Alex
 
PDF
Hadoop as an extension of DW
Sidi yazid
 
PDF
2013 NIST Big Data Subgroups Combined Outputs
Bob Marcus
 
PDF
Hadoop
Veera Sundari
 
PPTX
Big Data: It’s all about the Use Cases
James Serra
 
PDF
NIST Big Data Working Group.pdf
Bob Marcus
 
PDF
50 Shades of SQL
DataWorks Summit
 
PDF
Introduction to Hadoop
Ovidiu Dimulescu
 
PDF
TCS_DATA_ANALYSIS_REPORT_ADITYA
Aditya Srinivasan
 
PDF
Hadoop and SQL: Delivery Analytics Across the Organization
Seeling Cheung
 
PDF
Big data presentation (2014)
Xavier Constant
 
PDF
How can Hadoop & SAP be integrated
Douglas Bernardini
 
PDF
Hadoop on Azure, Blue elephants
Ovidiu Dimulescu
 
PDF
A Comprehensive Study on Big Data Applications and Challenges
ijcisjournal
 
PPTX
Harnessing the value of big data analytics
Sowmia Sathyan
 
PPTX
Big data Hadoop
Ayyappan Paramesh
 
PPT
Hadoop Demo eConvergence
kvnnrao
 
Big Data Practice_Planning_steps_RK
Rajesh Jayarman
 
Hadoop - Architectural road map for Hadoop Ecosystem
nallagangus
 
Modern data warehouse
Stephen Alex
 
Modern data warehouse
Stephen Alex
 
Hadoop as an extension of DW
Sidi yazid
 
2013 NIST Big Data Subgroups Combined Outputs
Bob Marcus
 
Big Data: It’s all about the Use Cases
James Serra
 
NIST Big Data Working Group.pdf
Bob Marcus
 
50 Shades of SQL
DataWorks Summit
 
Introduction to Hadoop
Ovidiu Dimulescu
 
TCS_DATA_ANALYSIS_REPORT_ADITYA
Aditya Srinivasan
 
Hadoop and SQL: Delivery Analytics Across the Organization
Seeling Cheung
 
Big data presentation (2014)
Xavier Constant
 
How can Hadoop & SAP be integrated
Douglas Bernardini
 
Hadoop on Azure, Blue elephants
Ovidiu Dimulescu
 
A Comprehensive Study on Big Data Applications and Challenges
ijcisjournal
 
Harnessing the value of big data analytics
Sowmia Sathyan
 
Big data Hadoop
Ayyappan Paramesh
 
Hadoop Demo eConvergence
kvnnrao
 

More from Mark Smith (10)

PDF
Ss jan19 2020_isafepeople
Mark Smith
 
PDF
Ss jan12 2020_introboundaries
Mark Smith
 
PDF
Ss dec092018genesis
Mark Smith
 
PDF
The Bridge Sunday School. Acts Prayer Model Week 1
Mark Smith
 
PDF
The Bridge Sunday School. Acts Prayer Model Week 2
Mark Smith
 
PDF
Data Pipeline for The Big Data/Data Science OKC
Mark Smith
 
PDF
Sunday School Trial of Jesus
Mark Smith
 
PDF
Ss sep11 2016_apologetics
Mark Smith
 
PDF
Ss aug28 2016_apologetics
Mark Smith
 
PDF
Tulsa techfest Spark Core Aug 5th 2016
Mark Smith
 
Ss jan19 2020_isafepeople
Mark Smith
 
Ss jan12 2020_introboundaries
Mark Smith
 
Ss dec092018genesis
Mark Smith
 
The Bridge Sunday School. Acts Prayer Model Week 1
Mark Smith
 
The Bridge Sunday School. Acts Prayer Model Week 2
Mark Smith
 
Data Pipeline for The Big Data/Data Science OKC
Mark Smith
 
Sunday School Trial of Jesus
Mark Smith
 
Ss sep11 2016_apologetics
Mark Smith
 
Ss aug28 2016_apologetics
Mark Smith
 
Tulsa techfest Spark Core Aug 5th 2016
Mark Smith
 

Recently uploaded (20)

PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PDF
Fundamentals and Techniques of Biophysics and Molecular Biology (Pranav Kumar...
RohitKumar868624
 
PPTX
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PPTX
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
PPTX
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PDF
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
PPTX
Presentation on animal welfare a good topic
kidscream385
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PPTX
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
PPTX
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PPTX
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
Fundamentals and Techniques of Biophysics and Molecular Biology (Pranav Kumar...
RohitKumar868624
 
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
Introduction to Data Analytics and Data Science
KavithaCIT
 
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
Presentation on animal welfare a good topic
kidscream385
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 

Big data meet_up_08042016

  • 1. Use of Big Data Architecture Atif Farid Mohammad, PhD Data Science Professor, Adjunct UNC Charlotte
  • 2. Big Data • How Big it is? • Does it matter? • ? • ? • ? • ? • There can be many more questions…
  • 3. Word of Caution •Kindly Avoid Thinking in SQL Mode •For this talk’s time period…
  • 6. VS
  • 8. RDBMS • Schema • Required on the Write • Speed • Reads are Fast • Governance • Standard and Structured • Processing • Limited, No Data Processing • Data Types • Structured
  • 9. Hadoop • Schema • Required on the Read • Speed • Writes are Fast • Governance • Loosely Structured • Processing • Processing coupled with data • Data Types • Multi and Unstructured
  • 10. RDBMS vs. Hadoop • Schema • Required on the Write • Speed • Reads are Fast • Governance • Standard and Structured • Processing • Limited, No Data Processing • Data Types • Structured • Schema • Required on the Read • Speed • Writes are Fast • Governance • Loosely Structured • Processing • Processing coupled with data • Data Types • Multi and Unstructured
  • 11. Attributes IT Systems Hadoop Data Size Gigabytes Peta/Zeta Bytes Access Batch & Interactive Batch CRUD Read & Write Many Times Write Once, Read Many Times Structure Static Dynamic Integrity Normalization De-Normalization Scalability Non-Linear Linear Differences between IT Systems and Hadoop
  • 12. A Scenario to Understand Big Data •A Trucking Company collects… Using…???
  • 13. A Scenario to Understand Big Data… • GPS • Speed • Acceleration • Stopping • Normal • To Quick • Driving to Close to other Vehicles
  • 14. What Standard Technologies You will use???
  • 15. Hadoop EcoSystem Utilization • Flume to get raw sensor data • Sqoop to transport data to HDFS about • Driver • Vehicle • Hcatalog to have all schema definition • Hive to analyze Gas Milage • Pig to compute Risk Factor for each Truck Driver based on his/her related events • Spark to create Data Sets by applying Machine Learning
  • 17. 17 Data Acquisition • Input • Multiple user event feeds (browsing activities, search etc.) per time period User Time Event Source U1 T0 visited Bank Site Server logs U1 T1 searched for “Credit Cards” Search logs U1 T2 browsed Banking Services Web server logs U1 T3 Saw an e-Mail sent link Link advertising logs U1 T4 Used OLTP Web server logs U1 T5 clicked on an ad for “some insurance” Ad logs, click server logs
  • 18. 18 Data Acquisition for the Landing Zone Event Feeds User event Normalized Events (NE) User event User event Project relevant event attributes Filter irrelevant events Tag and Transform • Categorization • Topic • …. HDFSUser event User event User event Map Operations
  • 19. 19 Data Acquisition for the Landing Zone • Output: • Single normalized feed containing all events for all users per time period User Time Event Tag U1 T0 Content browsing Web clicks by a Bank’s user U2 T2 Search query Category: Credit Card … … ……. ……… ... … ……. ……… U23 T23 OLTP usage Drop event U36 T36 Bank’s site page click Category: Some product
  • 20. 20 Feature and Target Generation for the Discovery Zone • Features: • Summaries of user activities over a time window • Aggregates, Moving averages, Rates etc. over moving time windows • Support online updates to existing features • Targets: • Constructed in the offline model training phase • Typically user actions in the future time period indicating interest • Clicks/Click-through financial product offering and content • Site and page visits • Conversion events • Deposit, Withdrawal, Quote requests etc. • Sign-ups to newsletters, Registrations etc.
  • 21. 21 Feature Generation for Discovery Zone NE 1 Feature Set HDFSNE 4 NE 2 NE 5 NE 6 NE 3 NE 7 NE 8 NE 9 Aggregate Normalized events Map 1 U1, Event 1 Map 2 U1, Event 2 Map 3 U1, Event 2 Reduce 1 Reduce 2 All events for U1 U2, Event 2 U2, Event 3 U2, Event 1 All events for U2 Summaries over user event history Aggregates within window Time and event weighted averages Event rates ……..
  • 22. 22 Modeling Workflow within the Discovery Zone Target generation Feature generation Data Acquisition User event history Targets Features Model Training Weights Training Phase Target generation Feature generation Data Acquisition User event history Targets Features Evaluation Phase Model Scoring Evaluation Scores
  • 23. 23 Batch Scoring for Discovery Results Data Acquisition User event history Feature generation Features Online Serving Systems Model Scoring Scores Weights
  • 24. 24 Discovery Zone Pipeline System Estimation Component Data Processed Time Estimation Data Acquisition ~ 1 Tb per time period 2 – 3 hours Feature and Target Generation ~ 1 Tb * Size of feature window 4 - 6 hours Model Training ~ 50 - 100 Gb 1 – 2 hours for 100’s of models Scoring ~ 500 Gb 1 hour
  • 25. Requirements Extraction Process • Two-step process is used for requirement extraction: 1) Extract specific requirements and map to reference architecture based on each application’s characteristics such as: a) data sources (data size, file formats, rate of grow, at rest or in motion, etc.) b) data lifecycle management (curation, conversion, quality check, pre-analytic processing, etc.) c) data transformation (data fusion/mashup, analytics), d) capability infrastructure (software tools, platform tools, hardware resources such as storage and networking), and e) data usage (processed results in text, table, visual, and other formats). f) all architecture components informed by Goals and use case description g) Security & Privacy has direct map 2) Aggregate all specific requirements into high-level generalized requirements which are vendor-neutral and technology agnostic. 25
  • 26. Cloud Business Intelligence  Data Analyses  Data Cleansing  Entity Relationship Modeling  Dimensional Modeling  Database Design & Implementation  Database Population through ETL/ELT  Downstream Applications linkage - Metadata  Maintaining the processes Source Data Extensive processes and costs: Big Data Edge from Data Warehouse Data Marts Analytical Database Analytical Database Analytical Database Analytical Database Analytical Database
  • 27. Metadata Management Security and Data Privacy System Management and Administration Network Connectivity, Protocols & Access Middleware Hardware & Software Platforms Web Browser Portals Devices (ex.: mobile) Web Services Access Collaboration BusinessApplications Query & Reporting Data Mining Modeling Scorecard Visualization Embedded Analytics Analytics Operational Data Stores Data Warehouse Data Marts Staging Areas Metadata Data Repositories Extraction Transformation Load / Apply Synchronization Transport / Messaging Information Integrity Data Integration Data Flow and Workflow Enterprise Unstructured Informational External Data Sources Supplier Orders Product Promotions Customer Location Invoice ePOS Other HDFS Analytical Data Marts HCatalog Data Lake Sqoop MapReduce/PIG Load / Apply Single Source HCatalog & Pig Can work with most ETL tools on the market Transport / Messaging Metadata Management - HCatalog
  • 28. Metadata Management Security and Data Privacy System Management and Administration Network Connectivity, Protocols & Access Middleware Hardware & Software Platforms Web Browser Portals Devices (ex.: mobile) Web Services Access Collaboration BusinessApplications Query & Reporting Data Mining Modeling Scorecard Visualization Embedded Analytics Analytics Operational Data Stores Data Warehouse Data Marts Staging Areas Metadata Data Repositories Extraction Transformation Load / Apply Synchronization Transport / Messaging Information Integrity Data Integration Data Flow and Workflow Enterprise Unstructured Informational External Data Sources Supplier Orders Product Promotions Customer Location Invoice ePOS Other Reference Architecture
  • 29. Metadata Management Security and Data Privacy System Management and Administration Network Connectivity, Protocols & Access Middleware Hardware & Software Platforms Web Browser Portals Devices (ex.: mobile) Web Services Access Collaboration BusinessApplications Query & Reporting Data Mining Modeling Scorecard Visualization Embedded Analytics Analytics Operational Data Stores Data Warehouse Data Marts Staging Areas Metadata Data Repositories Extraction Transformation Load / Apply Synchronization Transport / Messaging Information Integrity Data Integration Data Flow and Workflow Enterprise Unstructured Informational External Data Sources Supplier Orders Product Promotions Customer Location Invoice ePOS Other Transport / Messaging HCatalog – Hadoop metadata repository and management service that provides a centralized way for data processing systems to understand the structure and location of the data stored within Apache Hadoop. Extraction is an application used to transfer data, usually from relational databases to a flat file, which can then be use to transport to a landing are of a Data Warehouse and ingest into BI/DW environment. Reference Architecture Extraction Sqoop – is a command-line interface application for transferring data between relational databases and Hadoop. It supports incremental loads of a single table or a free form SQL query as well as saved jobs which can be run multiple times to import updates made to a database since the last import. Exports can be used to put data from Hadoop into a relational database. Source Extract Target Source Target Sqoop Current BI Proposed BI sftp Database extract MapReduce – A framework for writing applications that processes large amounts of structured and unstructured data in parallel across large clusters of machines in a very reliable and fault-tolerant manner. Pig – A platform for processing and analyzing large data sets. Pig consists on a high-level language (Pig Latin) for expressing data analysis programs paired with the MapReduce framework for processing these programs. Transformation Landing Staging DW HDFS DM Current BI Proposed BI DM MapReduce/PigComplex ETL Complex ETL Complex ETL Load / Apply Staging DW DM Current BI Proposed BI DM Synchronization Synchronization – The ETL process takes source data from staging, transforms using business rules and loads into central repository DW. In this scenario, in order to retain information integrity, one has to put in place a synchronization checks & correction mechanism. HDFS as a Single Source – In the proposed solution HDFS acts as a single source of data so there is no danger of desinhronization. The inconsistencies resulted from duplicated or inconsistent data will be reconciled with assistance of HCatalog and proper data governance. Staging DW Landing Synchronization Source DM HDFSSource DM Information Integrity Current – Currently there is no special approach to the data quality other than imbedded into the ETL processes and logic. There are tools and approaches to implement QA & QC. Hadoop – More focused approach - While we use HDFS as a one big “Data Lake” QA and QC will be applied at the Data Mart Level where the actual transformations will occur, hence reducing the overall effort. QA & QC will be an integral part of Data Governance and augmented by usage of HCatalog.
  • 30. Metadata Management Security and Data Privacy System Management and Administration Network Connectivity, Protocols & Access Middleware Hardware & Software Platforms Web Browser Portals Devices (ex.: mobile) Web Services Access Collaboration BusinessApplications Query & Reporting Data Mining Modeling Scorecard Visualization Embedded Analytics Analytics Operational Data Stores Data Warehouse Data Marts Staging Areas Metadata Data Repositories Extraction Transformation Load / Apply Synchronization Transport / Messaging Information Integrity Data Integration Data Flow and Workflow Enterprise Unstructured Informational External Data Sources Supplier Orders Product Promotions Customer Location Invoice ePOS Other Data Repositories Operational Data Stores Data Warehouse Data Marts Staging Areas Metadata HDFS HCatalog HCatalog Metadata Management HCatalog – A Hadoop metadata repository and management service that provides a centralized way for data processing systems to understand the structure and location of the data stored within Apache Hadoop. Reference Architecture Hadoop Distributed File System (HDFS) – A reliable and distributed Java-based file system that allows large volumes of data to be stored and rapidly accessed across large clusters of commodity servers
  • 31. HCatalog Metadata Management Security and Data Privacy System Management and Administration Network Connectivity, Protocols & Access Middleware Hardware & Software Platforms Web Browser Portals Devices (ex.: mobile) Web Services Access Collaboration BusinessApplications Query & Reporting Data Mining Modeling Scorecard Visualization Embedded Analytics Analytics Data Flow and Workflow Enterprise Unstructured Informational External Data Sources Supplier Orders Product Promotions Customer Location Invoice ePOS Other HDFS Analytical Data Marts HCatalog Data Repositories Sqoop MapReduce/PIG Load / Apply Single Source HCatalog & Pig Can work with Informatica Data Integration Transport / Messaging Reference Architecture
  • 32. Capability Current BI Proposed BI Expected Change Data Sources Source Applications Source Applications No Data Integration Extraction from Source DB Export Sqoop On-to-one change Transport/Messaging SFTP SFTP No Staging Area Transformations/Load Complex ETL Code None required eliminated Extract from Staging Complex ETL Code None required eliminated Transformation for DW Complex ETL Code None required eliminated Load to DW Complex ETL, RDBMS None required eliminated Extract from from DW, Transformation and load to DM Complex ETL code & process to feed DM MapReduce/Pig simplified transformations from HDFS to DM Data Quality , Balance & Controls mbedded ETL Code MapReduce/Pig in conjunction with HCatalog; Can also coexist with Informatica Yes Reference Architecture
  • 34. Map Operation MAP: Input data  <key, value> pair Data Collection: split1 web 1 weed 1 green 1 sun 1 moon 1 land 1 part 1 web 1 green 1 … 1 KEY VALUE Split the data to Supply multiple processors Data Collection: split 2 Data Collection: split n Map …… Map 34 web 1 weed 1 green 1 sun 1 moon 1 land 1 part 1 web 1 green 1 … 1 KEY VALUE web 1 weed 1 green 1 sun 1 moon 1 land 1 part 1 web 1 green 1 … 1 KEY VALUE web 1 weed 1 green 1 sun 1 moon 1 land 1 part 1 web 1 green 1 … 1 KEY VALUE web 1 weed 1 green 1 sun 1 moon 1 land 1 part 1 web 1 green 1 … 1 KEY VALUE …
  • 35. Reduce Reduce Reduce Reduce Operation MAP: Input data  <key, value> pair REDUCE: <key, value> pair  <result> Data Collection: split1 Split the data to Supply multiple processors Data Collection: split 2 Data Collection: split n Map Map …… Map 35 …
  • 37. CountCountCount Large scale data splits Parse-hash Parse-hash Parse-hash Parse-hash Map <key, 1> Reducers (say, Count) P-0000 P-0001 P-0002 , count1 , count2 ,count3 37
  • 38. Web References • “MapReduce: Simplified Data Processing on Large Clusters”, Jeffrey Dean and Sanjay Ghemawat, December 2004. http://labs.google.com/papers/mapreduce.html • “Scalable SQL”, ACM Queue, Michael Rys, April 19, 2011 http://queue.acm.org/detail.cfm?id=1971597 • “a practical guide to noSQL”, Posted by Denise Miura on March 17, 2011 at http://blogs.marklogic.com/2011/03/17/a-practical-guide-to-nosql/