Big data meet_up_08042016

Use of Big Data Architecture
Atif Farid Mohammad, PhD
Data Science Professor, Adjunct
UNC Charlotte

Big Data
• How Big it is?
• Does it matter?
• ?
• ?
• ?
• ?
• There can be many more questions…

Word of Caution
•Kindly Avoid Thinking in SQL Mode
•For this talk’s time period…

Differences… RDBMS vs. Hadoop

RDBMS
• Schema
• Required on the Write
• Speed
• Reads are Fast
• Governance
• Standard and Structured
• Processing
• Limited, No Data Processing
• Data Types
• Structured

Hadoop
• Schema
• Required on the Read
• Speed
• Writes are Fast
• Governance
• Loosely Structured
• Processing
• Processing coupled with data
• Data Types
• Multi and Unstructured

RDBMS vs. Hadoop
• Schema
• Required on the Write
• Speed
• Reads are Fast
• Governance
• Standard and Structured
• Processing
• Limited, No Data Processing
• Data Types
• Structured
• Schema
• Required on the Read
• Speed
• Writes are Fast
• Governance
• Loosely Structured
• Processing
• Processing coupled with data
• Data Types
• Multi and Unstructured

Attributes IT Systems Hadoop
Data Size Gigabytes Peta/Zeta Bytes
Access Batch & Interactive Batch
CRUD Read & Write Many Times Write Once, Read Many
Times
Structure Static Dynamic
Integrity Normalization De-Normalization
Scalability Non-Linear Linear
Differences between IT Systems and Hadoop

A Scenario to Understand Big Data
•A Trucking Company collects… Using…???

A Scenario to Understand Big Data…
• GPS
• Speed
• Acceleration
• Stopping
• Normal
• To Quick
• Driving to Close to other Vehicles

What Standard Technologies You will use???

Hadoop EcoSystem Utilization
• Flume to get raw sensor data
• Sqoop to transport data to HDFS about
• Driver
• Vehicle
• Hcatalog to have all schema definition
• Hive to analyze Gas Milage
• Pig to compute Risk Factor for each Truck Driver based on his/her
related events
• Spark to create Data Sets by applying Machine Learning

17
Data Acquisition
• Input
• Multiple user event feeds (browsing activities, search etc.) per time period
User Time Event Source
U1 T0 visited Bank Site Server logs
U1 T1 searched for “Credit Cards” Search logs
U1 T2 browsed Banking Services Web server logs
U1 T3 Saw an e-Mail sent link Link advertising logs
U1 T4 Used OLTP Web server logs
U1 T5 clicked on an ad for “some insurance” Ad logs, click server logs

18
Data Acquisition for the Landing Zone
Event
Feeds
User
event Normalized
Events (NE)
User
event
User
event
Project relevant
event attributes
Filter irrelevant
events
Tag and Transform
• Categorization
• Topic
• ….
HDFSUser
event
User
event
User
event
Map Operations

19
Data Acquisition for the Landing Zone
• Output:
• Single normalized feed containing all events for all users per time period
User Time Event Tag
U1 T0 Content browsing Web clicks by a Bank’s user
U2 T2 Search query Category: Credit Card
… … ……. ………
... … ……. ………
U23 T23 OLTP usage Drop event
U36 T36 Bank’s site page click Category: Some product

20
Feature and Target Generation for the Discovery
Zone
• Features:
• Summaries of user activities over a time window
• Aggregates, Moving averages, Rates etc. over moving time windows
• Support online updates to existing features
• Targets:
• Constructed in the offline model training phase
• Typically user actions in the future time period indicating interest
• Clicks/Click-through financial product offering and content
• Site and page visits
• Conversion events
• Deposit, Withdrawal, Quote requests etc.
• Sign-ups to newsletters, Registrations etc.

21
Feature Generation for Discovery Zone
NE 1
Feature
Set
HDFSNE 4
NE 2
NE 5 NE 6
NE 3
NE 7 NE 8 NE 9
Aggregate
Normalized
events
Map 1
U1, Event 1
Map 2
U1, Event 2
Map 3
U1, Event 2
Reduce 1 Reduce 2
All events for U1
U2, Event 2 U2, Event 3 U2, Event 1
All events for U2
Summaries over
user event history
Aggregates within window
Time and event weighted averages
Event rates
……..

22
Modeling Workflow within the Discovery Zone
Target generation
Feature generation
Data Acquisition
User
event
history
Targets
Features
Model Training
Weights
Training
Phase
Target generation
Feature generation
Data Acquisition
User
event
history
Targets
Features
Evaluation
Phase
Model Scoring
Evaluation
Scores

23
Batch Scoring for Discovery Results
Data Acquisition
User
event
history
Feature generation
Features
Online Serving
Systems
Model Scoring
Scores
Weights

24
Discovery Zone Pipeline System Estimation
Component Data Processed Time Estimation
Data Acquisition ~ 1 Tb per time period 2 – 3 hours
Feature and Target
Generation
~ 1 Tb * Size of feature
window
4 - 6 hours
Model Training ~ 50 - 100 Gb 1 – 2 hours for 100’s of
models
Scoring ~ 500 Gb 1 hour

Requirements Extraction Process
• Two-step process is used for requirement extraction:
1) Extract specific requirements and map to reference architecture based on each application’s
characteristics such as:
a) data sources (data size, file formats, rate of grow, at rest or in motion, etc.)
b) data lifecycle management (curation, conversion, quality check, pre-analytic processing, etc.)
c) data transformation (data fusion/mashup, analytics),
d) capability infrastructure (software tools, platform tools, hardware resources such as storage and
networking), and
e) data usage (processed results in text, table, visual, and other formats).
f) all architecture components informed by Goals and use case description
g) Security & Privacy has direct map
2) Aggregate all specific requirements into high-level generalized requirements which are
vendor-neutral and technology agnostic.
25

Cloud
Business Intelligence
 Data Analyses
 Data Cleansing
 Entity Relationship Modeling
 Dimensional Modeling
 Database Design & Implementation
 Database Population through ETL/ELT
 Downstream Applications linkage - Metadata
 Maintaining the processes
Source
Data
Extensive processes and costs:
Big Data Edge from Data Warehouse
Data Marts
Analytical
Database
Analytical
Database
Analytical
Database
Analytical
Database
Analytical
Database

Metadata Management
Security and Data Privacy
System Management and Administration
Network Connectivity, Protocols & Access Middleware
Hardware & Software Platforms
Web Browser
Portals
Devices
(ex.: mobile)
Web Services
Access
Collaboration
BusinessApplications
Query &
Reporting
Data Mining
Modeling
Scorecard
Visualization
Embedded
Analytics
Analytics
Operational
Data Stores
Data
Warehouse
Data Marts
Staging
Areas
Metadata
Data Repositories
Extraction
Transformation
Load / Apply
Synchronization
Transport /
Messaging
Information Integrity
Data Integration
Data Flow and Workflow
Enterprise
Unstructured
Informational
External
Data Sources
Supplier
Orders
Product
Promotions
Customer
Location
Invoice
ePOS
Other
HDFS
Analytical
Data Marts
HCatalog
Data Lake
Sqoop
MapReduce/PIG
Load / Apply
Single Source
HCatalog & Pig
Can work with most ETL tools
on the market
Transport /
Messaging
Metadata Management - HCatalog

Metadata Management
Web Browser
Portals
Devices
(ex.: mobile)
Web Services
Access
Collaboration
Query &
Reporting
Data Mining
Modeling
Scorecard
Visualization
Embedded
Analytics
Analytics
Operational
Data Stores
Data
Warehouse
Data Marts
Staging
Areas
Metadata
Data Repositories
Extraction
Transformation
Load / Apply
Synchronization
Transport /
Messaging
Data Integration
Enterprise
Unstructured
Informational
External
Data Sources
Supplier
Orders
Product
Promotions
Customer
Location
Invoice
ePOS
Other
Reference Architecture

Metadata Management
Web Browser
Portals
Devices
(ex.: mobile)
Web Services
Access
Collaboration
Query &
Reporting
Data Mining
Modeling
Scorecard
Visualization
Embedded
Analytics
Analytics
Operational
Data Stores
Data
Warehouse
Data Marts
Staging
Areas
Metadata
Data Repositories
Extraction
Transformation
Load / Apply
Synchronization
Transport /
Messaging
Data Integration
Enterprise
Unstructured
Informational
External
Data Sources
Supplier
Orders
Product
Promotions
Customer
Location
Invoice
ePOS
Other
Transport /
Messaging
HCatalog – Hadoop metadata repository and management
service that provides a centralized way for data processing systems
to understand the structure and location of the data stored within
Apache Hadoop.
Extraction is an application used to transfer data, usually from
relational databases to a flat file, which can then be use to transport to a
landing are of a Data Warehouse and ingest into BI/DW environment.
Extraction
Sqoop – is a command-line interface application for transferring data between relational
databases and Hadoop. It supports incremental loads of a single table or a free form SQL query
as well as saved jobs which can be run multiple times to import updates made to a database
since the last import. Exports can be used to put data from Hadoop into a relational database.
Source
Extract Target Source Target
Sqoop
Current BI Proposed BI
sftp
Database extract
MapReduce – A framework for writing applications that processes large amounts of
structured and unstructured data in parallel across large clusters of machines in a very reliable
and fault-tolerant manner.
Pig – A platform for processing and analyzing large data sets. Pig consists on a high-level
language (Pig Latin) for expressing data analysis programs paired with the MapReduce
framework for processing these programs.
Transformation
Landing
Staging
DW
HDFS
DM
DM
MapReduce/PigComplex ETL
Complex ETL
Complex ETL
Load / Apply
Staging
DW
DM
DM
Synchronization
Synchronization – The ETL process takes source data from staging, transforms using
business rules and loads into central repository DW. In this scenario, in order to retain
information integrity, one has to put in place a synchronization checks & correction mechanism.
HDFS as a Single Source – In the proposed solution HDFS acts as a single source of
data so there is no danger of desinhronization. The inconsistencies resulted from duplicated or
inconsistent data will be reconciled with assistance of HCatalog and proper data governance.
Staging
DW
Landing
Synchronization
Source DM
HDFSSource DM
Current – Currently there is no special approach to the data quality other than
imbedded into the ETL processes and logic. There are tools and approaches to
implement QA & QC.
Hadoop – More focused approach - While we use HDFS as a one big “Data Lake” QA
and QC will be applied at the Data Mart Level where the actual transformations will
occur, hence reducing the overall effort. QA & QC will be an integral part of Data
Governance and augmented by usage of HCatalog.

Metadata Management
Web Browser
Portals
Devices
(ex.: mobile)
Web Services
Access
Collaboration
Query &
Reporting
Data Mining
Modeling
Scorecard
Visualization
Embedded
Analytics
Analytics
Operational
Data Stores
Data
Warehouse
Data Marts
Staging
Areas
Metadata
Data Repositories
Extraction
Transformation
Load / Apply
Synchronization
Transport /
Messaging
Data Integration
Enterprise
Unstructured
Informational
External
Data Sources
Supplier
Orders
Product
Promotions
Customer
Location
Invoice
ePOS
Other
Data Repositories
Operational
Data Stores
Data
Warehouse
Data Marts
Staging
Areas
Metadata
HDFS
HCatalog
HCatalog Metadata Management
HCatalog – A Hadoop metadata repository and management service
that provides a centralized way for data processing systems to understand
the structure and location of the data stored within Apache Hadoop.
Hadoop Distributed File System (HDFS) – A reliable and distributed Java-based file
system that allows large volumes of data to be stored and rapidly accessed across large
clusters of commodity servers

HCatalog Metadata Management
Web Browser
Portals
Devices
(ex.: mobile)
Web Services
Access
Collaboration
Query &
Reporting
Data Mining
Modeling
Scorecard
Visualization
Embedded
Analytics
Analytics
Enterprise
Unstructured
Informational
External
Data Sources
Supplier
Orders
Product
Promotions
Customer
Location
Invoice
ePOS
Other
HDFS
Analytical
Data Marts
HCatalog
Data Repositories
Sqoop
MapReduce/PIG
Load / Apply
Single Source
HCatalog & Pig
Can work with Informatica
Data Integration
Transport /
Messaging

Capability Current BI Proposed BI Expected
Change
Data Sources Source Applications Source Applications No
Data Integration
Extraction from Source DB Export Sqoop On-to-one change
Transport/Messaging SFTP SFTP No
Staging Area
Transformations/Load
Complex ETL Code None required eliminated
Extract from Staging Complex ETL Code None required eliminated
Transformation for DW Complex ETL Code None required eliminated
Load to DW Complex ETL, RDBMS None required eliminated
Extract from from DW,
Transformation and load to DM
Complex ETL code & process to feed DM MapReduce/Pig simplified transformations
from HDFS to DM
Data Quality , Balance & Controls mbedded ETL Code MapReduce/Pig in conjunction
with HCatalog; Can also coexist
with Informatica
Yes

Map Operation
MAP: Input data  <key, value> pair
Data
Collection: split1
web 1
weed 1
green 1
sun 1
moon 1
land 1
part 1
web 1
green 1
… 1
KEY VALUE
Split the data to
Supply multiple
processors
Data
Collection: split 2
Data
Collection: split n
Map
…… Map
34
web 1
weed 1
green 1
sun 1
moon 1
land 1
part 1
web 1
green 1
… 1
KEY VALUE
web 1
weed 1
green 1
sun 1
moon 1
land 1
part 1
web 1
green 1
… 1
KEY VALUE
web 1
weed 1
green 1
sun 1
moon 1
land 1
part 1
web 1
green 1
… 1
KEY VALUE
web 1
weed 1
green 1
sun 1
moon 1
land 1
part 1
web 1
green 1
… 1
KEY VALUE
…

Reduce
Reduce
Reduce
Reduce Operation
MAP: Input data  <key, value> pair
REDUCE: <key, value> pair  <result>
Data
Collection: split1 Split the data to
Supply multiple
processors
Data
Collection: split 2
Data
Collection: split n Map
Map
…… Map
35
…

Cat
Bat
Dog
Other
Words
(size:
TByte)
map
map
map
map
split
split
split
split
combine
combine
combine
reduce
reduce
reduce
part0
part1
part2
MapReduce
36

CountCountCount
Large scale data splits
Parse-hash
Parse-hash
Parse-hash
Parse-hash
Map <key, 1> Reducers (say, Count)
P-0000
P-0001
P-0002
, count1
, count2
,count3
37

Web References
• “MapReduce: Simplified Data Processing on Large Clusters”, Jeffrey Dean and
Sanjay Ghemawat, December 2004.
http://labs.google.com/papers/mapreduce.html
• “Scalable SQL”, ACM Queue, Michael Rys, April 19, 2011
http://queue.acm.org/detail.cfm?id=1971597
• “a practical guide to noSQL”, Posted by Denise Miura on March 17, 2011 at
http://blogs.marklogic.com/2011/03/17/a-practical-guide-to-nosql/

Big data meet_up_08042016

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Big data meet_up_08042016 (20)

More from Mark Smith (10)

Recently uploaded (20)

Big data meet_up_08042016