Big Data for Managers: From hadoop to streaming and beyond

Big Data for Managers:
From Hadoop to Streaming and Beyond
Dr. Vladimir Bacvanski
vladimir.bacvanski@scispike.com
@OnSo5ware

www.scispike.com Copyright © SciSpike 2016
Dr. Vladimir Bacvanski
§  Founder of SciSpike, a development,
consulting, and training firm
§  Passionate about software and data
§  PhD in computer science RWTH Aachen,
Germany
§  Architect, consultant, mentor
§  Custom development: Scalable Web
and IoT systems
§  Training and mentoring in
Big Data, Scala, node.js, software
architecture
@OnSoftware
https://www.linkedin.com/in/vladimirbacvanski

Problems with Rela9onal Stores
§  Data that does not naturally ﬁt into tables
à Impedance mismatch
§  Development Eme o5en to long
§  Dealing with unstructured data
§  Performance problems
§  Diﬃcult to run on clusters
§  Cost
3

Structured and Unstructured Data Sources
Structured Data Sources
• ExisEng databases
• ERP/CRM/BI systems
• Inventory
• Supply chain
Unstructured Data
Sources
• Server logs
• Search engine logs
• Browsing logs
• E-Commerce records
• Social media
• Voice
• Video
• Sensor data
4

NoSQL Impact
5
Disks
Processors
x1000 x1000 x1000
Cost / Performance
1M 1B 1T 1Q …HUGE!!! x1000
Rela9onal
Database
Big Data + NoSQL
Tomorrow - Volume
is out of reach
Today - Doable, but
expensive and slow
Stabilize Cost &
Increase Performance
Enable Unlimited
Volume Growth

Scale Up vs. Scale Out
6
Capability
Cost
Scale Up
Capability
Cost Scale Out

A Common PaNern for Processing Large Data
Load a large set of records onto a set of
machines
Extract something interesEng from
each record
Shuﬄe and sort intermediate results
Aggregate intermediate results
Store end result
7
"Map"
"Reduce"
Key/Value
pairs

Two Key Aspects of Hadoop
§  MapReduce framework
– How Hadoop understands and assigns work to the nodes
(machines)
§  Hadoop Distributed File System = HDFS
– Where Hadoop stores data
– A file system that spans all the nodes in a Hadoop cluster
– It links together the file systems on many local nodes to
make them into one big file system
8

MapReduce Example: Word Count
§  WordCount is the "Hello World" of Big Data
– You will see various technologies implemenEng it
– A good ﬁrst step to compare the expressiveness of Big Data
tools
9
dog cat bird
dog cat bird
dog dog cat
dog, 1
cat, 1
bird, 1
dog, 1
cat, 1
bird, 1
dog, 1
dog, 1
cat, 1
Map
dog, 1
dog, 1
dog, 1
dog, 1
cat, 1
cat, 1
cat, 1
bird, 1
bird, 1
Shuffle
dog, 4
cat, 3
bird, 2
Reduce
dog cat bird
dog cat bird
dog dog cat
pets.txt
dog, 4
cat, 3
bird, 2
pet_freq.txt

www.scispike.com Copyright © SciSpike 2016 10
The MapReduce Programming Model
§  "Map" step:
–  Input split into pieces
–  Worker nodes process individual pieces in parallel (under
global control of the Job Tracker node)
–  Each worker node stores its result in its local ﬁle system
where a reducer is able to access it

§  "Reduce" step:
–  Data is aggregated (‘reduced” from the map steps) by
worker nodes (under control of the Job Tracker)
–  MulEple reduce tasks can parallelize the aggregaEon
10

Separa9on of Work
Programmers
• Map
• Reduce
Framework
• Deals with fault
tolerance
• Assign workers to map
and reduce tasks
• Moves processes to
data
• Shuﬄes and sorts
intermediate data
• Deals with errors
11

How To Create MapReduce Jobs
§  Java API
– Low level, very ﬂexible
– Time consuming development
§  Streaming API
– A simple, producEve model for Python and Ruby
§  Hive
– Open source language / Apache sub-project
– Provides a SQL-like interface to Hadoop
§  Pig
– Data ﬂow language / Apache sub-project

15

The Big Picture: NoSQL + Hadoop in Applica9ons
16
Columnar
Price
updates
Logs
Document
Product
info
Graph
Customer
Agent
relaFon-
ships
RDB
XA data
Hadoop
Oper.
analyFcs
Price
analyFcs
Key/Value
Session
data
ApplicaFons

Streaming: A New Paradigm
§  ConvenEonal processing: sta9c data
DataQueries Results
§  Real-time processing: streaming data
QueriesData Results
17

Common Streaming Applica9ons
§  PersonalizaEon
§  Search
§  Revenue opEmizaEon
§  User events
§  Content feeds
§  Log processing
§  Monitoring
§  RecommendaEons
§  Ads
§  Notable users:
–  Twiper
–  Yahoo
–  SpoEfy
–  Cisco
–  Flickr
–  Weather Channel
18

Beyond Hadoop: Spark & Flink
19
MapReduce Tez
Spark
Flink

Apache Spark
§  Important Features
– In Memory Data
– Resilient Distributed Datasets (RDDs)
•  Datasets can rebuild themselves if failure occurs
– Rich set of operators
§  Eﬃcient:
– 10x (on Disk) -100x (In Memory) faster than Hadoop MR
– 2 to 5 Emes less code (Rich APIs in Scala/Java/Python)
20

Spark Architecture
§  A powerful set of tools
§  Beyond tradiEonal Hadoop
Source: hpp://spark.apache.org

Data Sharing in Apache Spark
H
D
F
S
IteraFon 1
Result 1
Held In
Cluster
Memory
IteraFon 2
Result 2
Held In
Cluster
Memory
Query 1
Query 2

Apache Flink
§  ExecuEon:
–  Programs compiled into an execuEon plan
–  Plan is opEmized
–  Executed
§  Design goals:
–  High performance
–  Hybrid batch and streaming runEme
–  Simplicity for the developer
–  Rich libraries
–  IntegraEon with many systems
23

Apache Flink Components
§  IntegraEon with Hadoop YARN, MapReduce, HBase,
Cassandra, Kara, …
§  ExecuEon engine for Apache Beam (Google Dataﬂow)
24

Flink Op9miza9on and Execu9on
§  OpEmizer selects an execuEon plan
§  Similar to what we have in relaEonal databases
§  OpEmal plan depends on the size of the input ﬁles
§  Run as standalone or on top of Hadoop
§  IntegraEon with many Hadoop technologies
25

Flink & Spark: The Advantages and Outlook
§  Less IO overhead than convenEonal Hadoop
§  Caching
§  IteraEve algorithms
§  Unifying batch and stream compuEng
§  Scala as a natural, expressive language for Big Data
– Other languages: Python, Java, R
§  Beware of less mature components
26

Typical NoSQL Systems
§  Non-relaKonal
§  Distributed
§  Horizontally scalable
§  No need for a ﬁxed schema
§  Several established
players
§  Systems are
specialized
27

NoSQL Stores and Their Categories
§  Choose a store that is a best match for your applicaEon
§  It is ﬁne to have several diﬀerent stores used
– "Polyglot persistence"
28
k v
Key-Value Column-
Family
Document-
Oriented
Graph DB

NoSQL Stores: Scale vs. Complexity of Data
29
k v
Key-Value
Column-
Family
Document-
Oriented
complexity
scalability
Graph DB
needs of most applicaFons

Key-Value Stores
§  Key à Value mapping
§  Large, persistent Map ("hashtable")
– Values could be lists and hashes
§  Easy to use
§  Scale very well
§  Data model may be too simple for most applicaEons
§  Systems:
– Redis, Riak, Memcached, Amazon DynamoDB, Aerospike,
FoundaEonDB
§  Use when data model is very simple and scalability essenEal

30

Typical Use Cases
§  The data model is very simple!
– Actual data can be JSON
§  Session data
§  User preferences and proﬁles
§  Shopping cart
§  If other NoSQL store is good enough, you may want to skip
this and let Column or Document store handle it
31

Column-Family
§  "Column-family": similar to a table
– Table is sparse
§  Key à (Column:Value)*
§  Columns have names
§  Can be indexed
§  Can store complex data
– Denormalize!
§  Systems:
– Google BigTable, HBase, Cassandra, Amazon SimpleDB,
Hypertable
§  Use when scalability is essenEal
32

Typical Use Cases
§  High insert volume: logging
§  Real-Eme updates
§  Content management
§  Expiring content
§  Cross-datacenter replicaEon
§  MapReduce analyEcs over stored data
§  You don’t need convenEonal (ACID) transacEons
33

Document Stores
§  JSON, BSON, XML
§  No schema
§  Indexes improve performance
§  Easy transiEon from RDBMS
§  Systems
– MongoDB, CouchDB, CouchBase
§  Use when data is in semi-structured form
§  O5en seen in new Web applicaEons
34

Typical Use Cases
§  Logging
– Especially with variable content
§  Product informaEon
§  Customer informaEon
§  Content management
§  Data to be stored has format that varies over Eme
– Flexible schema
§  Web analyEcs
35

Graph Databases
§  Nodes with properEes
§  Nodes connected through relaEonships
§  Can model very complex graph data
– Social networks
§  Systems:
– Neo4J, Inﬁnite Graph, TitanDB, OrientDB
§  Use when data is a (complex) graph
36

Typical Use Cases
§  Highly interconnected data
§  Social graphs
§  Party relaEonships in an enterprise
§  LocaEon based services
§  Purchasing analyEcs and recommendaEons
§  O5en combined with other systems to store the bulk of data
– Graph database can focus on relaEonships
37

Integra9ng Rela9onal, Streams, and Hadoop
Streams
Data +
Big Data
TradiEonal
Warehouse
In-MoEon
AnalyEcs
Data analyEcs Results
Database &
Warehouse
At-rest data
analyEcs
Results
Ultra Low Latency
Results
TradiEonal /
RelaEonal
Data Sources
Non-TradiEonal /
Non-RelaEonal
Data Sources
Varied data
formats
Semi-structured,
unstructured...
Event
System
NoSQL
38

Merge
Results
Lambda Architecture
39
Event (Speed) Layer
Real Time
Data
Batch Layer Serving Layer
Master
Dataset
Batch
View
Incoming
Data
Real Time Update
Batch Update
Queries
Rolling Values

Master Data Management and Governance
§  Big Data and NoSQL stores can easily become a bigger mess
than relaEonal stores
§  Introduce a pracEcal plan
– Avoid lengthy and cumbersome governance
– Actual use should be the driving force
– Start slow
§  Be ready for change
– The technologies change rapidly
§  Focus on business outcomes
40

Succeeding with Big Data and NoSQL
1.  AcEvely look for soluEons where the right store can ease the
pain
2.  Make sure you deliver tangible value to clients
3.  A5er you get your ﬁrst apps to work: create a Big Data
introducEon and governance plan
4.  PrioriEze: do the most useful thing for the business ﬁrst
5.  Integrate with exisEng IT
6.  Make sure you hire or grow your Big Data champions
7.  Field is immature: look out for new tools and techniques
41

Conclusions
– Hadoop and NoSQL address the weak points of relaEonal
systems:
•  Scale
•  Performance
•  Unstructured and semistructured data
– Streaming addresses the processing of data in real-Eme
– Integrate with convenEonal technologies!
– Spark and Flink: the next generaEon Big Data systems
42

Big Data for Managers: From hadoop to streaming and beyond

More Related Content

What's hot (18)

Viewers also liked (20)

Similar to Big Data for Managers: From hadoop to streaming and beyond (20)

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded (20)

Big Data for Managers: From hadoop to streaming and beyond