0% found this document useful (0 votes)
125 views106 pages

Big Data & Security Training Guide

This document provides a summary of the first day of a Big Data training course. It introduces key concepts like prerequisites for Big Data including data preparation, warehousing and mining. It also covers processing models like MapReduce and Big Data storage approaches like sharding and the CAP theorem. The document discusses introducing students to Big Data initiatives and research at a university focusing on areas like smart enterprise, infrastructure and society. It outlines some sample activities for the course around topics such as graph analytics, classification, identifying features and more.

Uploaded by

Waqas Rafiq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
125 views106 pages

Big Data & Security Training Guide

This document provides a summary of the first day of a Big Data training course. It introduces key concepts like prerequisites for Big Data including data preparation, warehousing and mining. It also covers processing models like MapReduce and Big Data storage approaches like sharding and the CAP theorem. The document discusses introducing students to Big Data initiatives and research at a university focusing on areas like smart enterprise, infrastructure and society. It outlines some sample activities for the course around topics such as graph analytics, classification, identifying features and more.

Uploaded by

Waqas Rafiq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 106

Big Data Training

1st Day

Ernesto Damiani
First Day Outline
• Introduction
• Prerequisites
– Data preparation
– Data warehousing
– Data mining
• Processing models and MapReduce paradigm
• Big Data storage basics
– Sharding
– CAP Theorem
Introduction
BIG DATA INITIATIVE
Drive open research & innovation collaboration with UAE and international
institutes and organisations to carry world leading research and deliver tangible
value, training, knowledge transfer and skills development in line with the UAE
strategic priorities in the areas of :
Smart enterprise, smart infrastructure & smart society

Security Research Center


SECURITY OF THE GLOBAL ICT INFRASTRUCTURE
Network and Communications Security
Business Process Security and Privacy
Security and Privacy of Big Data Platforms
SECURITY ASSURANCE
Security Risk Assessment and Metrics
Continuous Security Monitoring and Testing
DATA PROTECTION AND ENCRYPTION
High Performance Homomorphic Encryption
Lightweight Cryptography and Mutual Authentication
Some activities
Vision
• Big Data is not
just a
technological
advance but
represents a
paradigm shift
in extracting
value from
complex multi-
party
processes
From classic datawarehouse
to Big Data
The Five Vs
Big Data: Volume

• Lots of data is being collected and warehoused


– Web data, e-commerce
– Bank/Credit Card
transactions
– Social Networks
– Network-generated events
(LTE - 4G – 5G)
Big Data: Variety in Data Models
• Relational Data (Tables/Transaction/Legacy
Data)
• Text Data (Web)
• Semi-structured Data (XML)
• Graph Data
– Social Network, Semantic Web (RDF), …
• Streaming Data
– You can only scan the data once
What to do with these data?
• Aggregation and Statistics
– Data warehouse and OLAP
• Indexing, Searching, and Querying
– Keyword based search
– Pattern matching (XML/RDF)
• Knowledge discovery
– The algorithms mentioned in the next slide
What Big Data Analytics can deliver:
Knowledge Discovery techniques
Graph Analytics
Classification
Identifying Features
How we deliver it:
Processing
Models

• Batch vs streaming
• Hash vs sketch
Processing models: a change in paradigm
(1)
• Traditional data processing model:
SWAPPING
– Data are read BLOCK by BLOCK
– Disk -> RAM -> cache -> register
• Unfeasible for current data sizes
– Google indexes 10 billion pages, 200 Tbyte
– Disk bandwidth: 50 Mbyte/sec (for
contiguous reads!)
– Reading time > 40 days (!)
17
A change in paradigm (2)
• From SWAPPING to CLUSTERING
– Data partitioned in CHUNKS, each going to a
different node of the cluster
– Move computation to chunk sites (bring
computations to data)
– 10 Tbytes, 1000 nodes -> 10 Gbytes per node -> can
use SWAPPING locally

18
A change in paradigm (3)
• Problem: local servers may need to exchange data prior to
processing
– Example: I want to compute the average driving time from/to all
Emirates
• Chunks I have: each chunk contains records collected at arrival site A ,
showing trips form X to A (for all Xs)
• Chunk I need: each chunk containing records showing trips form X to
A (for a given X)
– Shuffling is needed -> BUT I need to consider network latency
• Assume to move 10 Tbytes at 1 Gbyte/sec -> many hours
– Need to do shuffling at collection, NOT at computation time
A change in paradigm (4)
• Even after I distribute computation to chunk
servers, some may fail
• Node failure
– 1000 servers – 1 failure per day
– Use redundancy for chunk persistence and
availability
– Build a redundant storage infrastructure -> multiple
replicas of each chunk server

20
Google GFS/HDFS
• Master-slave
• Slave servers: contain large data files, 16/64 Gbytes
each
• Master servers: contain metadata on file distribution
– Talk to master to find slave servers
• Google equation:
• Slave servers = Computation servers
• Avoid moving data
• Killer application: build global index

21
After Google: a vision
• Big Data is not just a technological advance but represents a
paradigm shift in extracting value from complex multi-party
scenarios
– Coming up: European Big Data ecosystem of data owners,
transformers, distributors, and analyzers.
• In this ecosystem:
– Data owners/suppliers compete to provide high quality and
timely information. Each supplier has its own privacy and
confidentiality constraints, as well as its level of data
quality, trustworthiness and credibility.
– Data transformers compete to sanitize, prepare and
publish data sets and streams
– Data analyzers compete to collect the “best” data from
data suppliers and transformers and process them to
deliver quality suggestions to decision makers within the
allotted time.
Actors and Roles
• Roles: collector, transformer, user
• Actors:
– Over-The-Top (OTT) operator: global application level
players (Google, Amazon..) who collect application
level data, use them for value-added-services
– Telcos (Vodafone, Etisalat,..): transport providers,
whio collect data at the lower stack levels, use them
for value-added-services (also to themselves)
– Cloud providers doing IaaS and SaaS: (Amazon,
Aruba..): collect data at all levels
Overlapping interests... different questions
• TELCO QUESTIONS • OTT QUESTIONS
• Traffic estimation • Traffic estimation

– How many bytes were sent between a pair of – Apps having at least 5k downloads in the last 20
devices? seconds
– What fraction network IP addresses are active? – List top 100 locations in terms of purchase
• Traffic analysis transactions
– What is the average duration of a session? • Traffic analysis
– What is the median of the number of bytes in – What is the per-app duration of a user session?
each session? – Median size of users’ personal cloud storage
• Fraud detection • Fraud detection
– List all sessions originating from location X that – List all purchases made by user X in the last 20
transmitted more than 1000 bytes minutes
– Identify all sessions started in the last 20 – Identify all purchases whose amount was more
seconds with duration more than twice normal than twice normal
• Security/Denial of Service
• Security/Denial of Service
– List all IP addresses that witnessed a sudden – List all services that have witnessed a sudden
spike in traffic spike in usage
– Identify IP who were in more than 1000 – Identify Iservices involved in more than 1000
sessions sessions
Sample Telco-style application:
Network Monitoring
Source Destination Duration Bytes Protocol
10.1.0.2 16.2.3.7 12 20K http
Network Operations 18.6.7.1 12.4.0.3 16 24K http
SNMP/RMON, 13.9.4.3 11.6.8.2 15 20K http
Center (NOC)
NetFlow records 15.2.2.9 17.1.2.1 19 40K http
12.4.3.8 14.8.7.4 26 58K http
10.5.1.3 13.0.0.1 27 100K ftp
Peer 11.1.0.6 10.3.4.5 32 300K ftp
Converged IP/MPLS
Core 19.7.1.2 16.5.5.8 18 80K ftp

Example NetFlow IP
Session Data
Enterprise PSTN
Networks
• FR, ATM, IP VPN Access Mobile Network • Voice over IP

• 24x7 IP packet/flow data-streams at network elements


• Truly massive streams arriving at rapid rates
– Vodafone collects ~1 Terabyte of NetFlow data each day

• Before: stored and shipped off-site to data warehouse for off-line analysis
• Today: analysis must be done within the network
25
Analytics models: static vs. dynamic
Handling batch and streams together:
the lambda architecture
EBTIC Methodology and Reference
Architecture
Open issues
• Enforcement of privacy, trustworthiness and access control
– Fast data granulation techniques for bringing large volumes of data to a
granularity and detail level compatible with the privacy preferences and
non-disclosure requirements of the data owners.
– Fast data filtering techniques to bring data at the desired level of
trustworthiness and credibility, as well as enforcing data owners’ access
authorizations.
• Dynamically adaptable data representation
– Dynamically adaptable semantic enrichment, e.g. by adding references to
each other, to external data vocabularies, or to other reference sources.
– Quantitative improvement (1) by “purchasing” additional data (e.g.,
publicly available open data) to add new information (2) by processing
available information to make implicit facts explicit.
– Qualitative improvement by turning data into assertions drawn on external
formal vocabularies
• Data and analytics co-provisioning
– Dynamic provisioning of data and of analytics tailored to the data
representation and distribution.
Prerequisites
What Data?
• Case management data
– Audits / Dynamic processing
• CRM /ERP data
– SAP, Oracle, …
• Servicedesk ITIL data
– HP service manager
• Internet click STREAMS
• Netflow data
• IoT
– Device hardware events
Data quality
• Data are of high quality if
– they are fit for their intended uses in operations,
decision making and planning.
– they correctly represent the real-world construct
to which they refer
• As data volume increases, the question of
internal consistency within data becomes
paramount, regardless of fitness for use for
any external purpose (J. M. Juran)
Adapted from: http://en.wikipedia.org/wiki/Data_quality
32
Definitions of data quality
• Degree of excellence exhibited by the data in relation to the
portrayal of the actual scenario.
• The state of completeness, validity, consistency, timeliness
and accuracy that makes data appropriate for a specific
use.
• The totality of features and characteristics of data that
bears on their ability to satisfy a given purpose; the sum of
the degrees of excellence for factors related to data.
• The processes and technologies involved in ensuring the
conformance of data values to business requirements and
acceptance criteria.
• Complete, standards based, consistent, accurate and time
stamped.
Adapted from: http://en.wikipedia.org/wiki/Data_quality
33
Challenges for data quality in a process
analysis context
General:
• No history (just latest activity, current/last status) OR too
much history, very large datasets
IDs:
• No IDs, across multiple systems OR multiple IDs
(customer_id, update_id, session_id)
Time:
• Unrepresentative or wrong timestamps
Activity:
• System / automated activities vs. human-interaction activities
vs. unregistered activities
34
What is data preparation?
• Cleaning

• Integration

• Selection

• Transformation

35
Challenges for data preparation
Cleaning:
• Removing duplicates / correcting date-timestamps / ID
disambiguation
Integration:
• Merging data sources
– How to deal with large blobs full of multimedia/ free text?
Selection:
• Connecting IDs to follow end-to-end process
Transformation:
• Formatting: activities in columns, you lose loops and assume a
pre-specified process is followed
• Which environment is receiving the prepared data?
36
Sampling
• Population: the set or universe of all entities under study.
• Looking at the entire population may not be feasible, or too
expensive.
• Instead, one draws a random sample from the population,
and compute appropriate statistics from the sample, that give
estimates of the corresponding population parameters of
interest.
What is a sample?
• Sampling is a general technique for tackling massive amounts
of data
• Example: To compute the median packet size of a stream of IP packets, we
just sample some and use the median of the sample as an estimate for the
true median. Statistical arguments relate the size of the sample to the
accuracy of the estimate.
• How big need the sample to be?.
– A poll to predict election outcomes could get by with no more than a
couple of thousand respondents to gauge the attitudes of millions of
voters.
– Even a massive data source with billions of rows may be just a small
portion of the data that could potentially be collected from the real
world (e.g., atmospheric temperature or pressure samples)
• How do we know how big a sample we need?
– Classical statistics has methods for that.
– The regression criterion: what you discover from your sample must
holds when tested on additional data
Randomness
• “Random sample” means that every case in the
population has an equal opportunity to get in the
sample.
– The most fundamental assumption of statistical
analysis is that samples are random
• If you use all the data in a dataset, you are not
really avoiding sampling.
– You will use your analysis to draw conclusions about future
cases – cases that are not in your data set today. So your Big
Data is still just a very, very big sample from the population that
matters to you
Random Sample and Statistics (II)
• Let Xi denote a random variable (e.g.: waiting time to
get a service) corresponding to data point xi (a timed
wait). Then, a statistic θ is a function θ : (X1, X2, · · · ,
Xn) → R.
• If we use the value of a statistic to estimate a
population parameter, this value is called a point
estimate of the parameter, and the statistic is called
as an estimator of the parameter.
• More with Dr. Gabriel later
Not always trivial
• How do you sample a sliding window on a
stream of unknown length?
Food for thought:
Sampled data vs. full data
• Since data mining began to take hold in the
late nineties, “sampling” has been neglected
• The Big Data frenzy is leading many to
conclude that size always means more
predictive power and value. The more data
the better, the biggest analysis is the best.
• Is this really the case?
Sampled data vs. full data
• Two legitimate reasons for abandoning sampling.
1. The customer base of Big Data data mining tools
is not trained in sampling techniques.
2. Going full data allows looking for extreme cases.
1. For example, in intelligence or security applications,
only a few cases out of millions may exhibit behavior
indicative of threatening activity. So anshould alysts
in those fields have a good reason to go full data.
Disclaimer 1
• More is not necessarily better.
– Analyzing massive quantities of data
consumes a lot of resources, in computing
power, storage space, in the effort of the
analyst.
– There’s also the matter of data quality. It is
easier to assure that a modest-sized sample
is cleaned up than a huge repository.
Disclaimer 2

• Sampling is not the only compression


method we can use to make Big data
small.
– Summarization methods: Hashing,
Sketching.
– More later.
Classical analytics on data sets
Warehouse Architecture
Client Client

Query & Analysis

Metadata Data
Warehouse

Merge-Integrate

Source Source

47
Star Schemas

• A star schema is a common organization for


data at a warehouse. It consists of:
1. Fact table : a very large accumulation of data points
(example: individual waiting times for MRIs).
 Often “insert-only.”
2. Dimension tables : smaller, generally static
information about the entities involved in the facts
(example: patient’s data, location of the healthcare
facility, type of MRI equipment requested)

48
Terms
sale
orderId
date customer
product
custId custId
• Fact table prodId
name prodId name
storeId address
price
• Dimension tables qty city
amt
• Measures

store
storeId
city

49
Star
product prodId name price store storeId city
p1 bolt 10 c1 nyc
p2 nut 5 c2 sfo
c3 la

sale oderId date custId prodId storeId qty amt


o100 1/7/97 53 p1 c1 1 12
o102 2/7/97 53 p2 c1 2 11
105 3/8/97 111 p1 c3 5 50

customer custId name address city


53 joe 10 main sfo
81 fred 12 main sfo
111 sally 80 willow la

50
Cube

Fact table view:


Multi-dimensional cube:

sale prodId storeId amt


p1 c1 12 c1 c2 c3
p2 c1 11 p1 12 50
p1 c3 50 p2 11 8
p2 c2 8

dimensions = 2

51
3-D Cube

Fact table view: Multi-dimensional cube:

sale prodId storeId date amt


p1 c1 1 12
p2 c1 1 11 c1 c2 c3
day 2
p1 c3 1 50 p1 44 4
p2 c2 1 8 p2 c1 c2 c3
p1 c1 2 44 day 1
p1 12 50
p1 c2 2 4 p2 11 8

dimensions = 3

52
ROLAP vs. MOLAP
• ROLAP:
Relational On-Line Analytical Processing
• MOLAP:
Multi-Dimensional On-Line Analytical
Processing

53
Aggregate metrics
• Add up amounts for day 1
• In SQL: SELECT sum(amt) FROM SALE
WHERE date = 1

sale prodId storeId date amt


p1 c1 1 12
p2 c1 1 11
p1 c3 1 50 81
p2 c2 1 8
p1 c1 2 44
p1 c2 2 4

54
Aggregate metrics
• Add up amounts by day
• In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date

sale prodId storeId date amt


p1 c1 1 12
p2 c1 1 11 ans date sum
p1 c3 1 50 1 81
p2 c2 1 8 2 48
p1 c1 2 44
p1 c2 2 4

55
Another Example
• Add up amounts by day, product
• In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId
sale prodId storeId date amt
p1 c1 1 12 sale prodId date amt
p2 c1 1 11
p1 1 62
p1 c3 1 50
p2 1 19
p2 c2 1 8
p1 c1 2 44
p1 2 48
p1 c2 2 4

rollup

drill-down

56
Aggregate metrics
• Operators: sum, count, max, min,
median, ave
• “Having” clause
• Using dimension hierarchy
– average by region (within store)
– maximum by month (within date)

57
Data Mining
• Discovery of useful, possibly unexpected,
patterns in data
• Non-trivial extraction of implicit, previously
unknown and potentially useful information
from data
• Exploration & analysis, by automatic or
semi-automatic means, of large quantities of
data in order to discover meaningful patterns
Data Mining Tasks
• Classification [Predictive]
• Clustering [Descriptive]
• Association Rule Discovery [Descriptive]
• Sequential Pattern Discovery [Descriptive]
• Regression [Predictive]
• Deviation Detection [Predictive]
• Collaborative Filter [Predictive]
Classification: Definition
• Given a collection of records (training set )
– Each record contains a set of attributes, one of the
attributes is the class.
• Find a model for class attribute as a function of
the values of other attributes.
• Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to
build the model and test set used to validate it.
Decision Trees
Example:
• Conducted survey to see what customers were
interested in new model car
• Want to select customers for advertising campaign

sale custId car age city newCar


c1 taurus 27 sf yes
c2 van 35 la yes training
c3 van 40 sf yes
c4 taurus 22 sf yes set
c5 merc 50 la no
c6 taurus 25 la no

61
Clustering

income

education

age

62
Association Rule Mining

tran1 cust33 p2, p5, p8


sales tran2 cust45 p5, p8, p11 market-basket
records: tran3 cust12 p1, p9
data
tran4 cust40 p5, p8, p11
tran5 cust12 p2, p9
tran6 cust12 p9

• Trend: Products p5, p8 often bought together


• Trend: Customer 12 likes product p9

63
Repeated clustering
• Goal: predict what products/services… a customer may be
interested in, on the basis of
– Past preferences of the person
– Preferences of other people with similar past preferences
• Repeated clustering
– Cluster people based on preferences
– Cluster products liked by the same clusters of people
– Again cluster people based on their preferences for (the newly created
clusters of) products
– Repeat till no more changes occur

64
Other Types of Mining
• Text mining: application of data mining to textual
documents
– cluster Web pages to find related pages
– cluster pages a user has visited (visit history)
– automatically classify Web pages into a directory
• Graph Mining:
– Deal with graph data
– RDF-style models (more later)

65
Enters Big Data
• Classic analytics assume:
– Standard data models/formats
– Reasonable volumes
– Loose deadlines
• Problem: The five Vs jeopardise these
assumptions – (unless we sample or
summarize)
Scaling Up vs Scaling Out
• Issues with scaling up when the dataset is just too big
• RDBMS were not designed to be distributed
– Best way to provide ACID and a rich query model is to have
the dataset on a single machine. However, there are limits to
scaling up (Vertical Scaling).
– Past a certain point, an organization will find it is cheaper and
more feasible to scale out (horizontal scaling) by adding
smaller, more inexpensive (relatively) servers rather than
investing in a single larger server.
• A number of different approaches to scaling out
(Horizontal Scaling).
• Two approaches:
– Master-slave
– Sharding
Back to Processing Models
Map/Reduce Basics
• Map/Reduce is a programming model for
efficient distributed computing
• It works like a Unix pipeline:
– cat input | grep | sort | uniq –c |cat > output
Input (GFS/HDFS) | Map | Shuffle & Sort |
Reduce | Output (GFS)
– Map is called for each <key,value> pair
in the GFS input
– No support for import and export from
GFS/HDFS
Where does efficiency come from
• Efficiency from
– Streaming through data
• No disk seeks
• Disks are streamed to HDFS, then HDFS <key
value> pairs are mapped/reduced
– Pipelining
• Mapping is parallel
• A good fit for a lot of applications
– Web index building (Google!)
– Log processing
Example
• Need to build the histogram of English word frequencies, in all
Shakespeare plays
• Solution 1 (swapping): There is enough room in memory for the file of all
Shakespeare plays
– Load file from disk
– Scan file in memory and create <word, count> table
• Solution 2: (scanning) File too big for memory, but the list of all words
used by Shakespeare fits
– Create <word, counter> array (no duplicate words)
– Scan (not load!) file on disk and update counter when you find word
– Faster
• Solution 3 (chunking): Even the list of all words used by Shakespeare
does not fit
– Scan file on disk and extract <word, 1> entries (will get duplicate words)
– Send for each word, send <word, 1> entries to a “chunk server”
– Compute <word, n> on each chunk server
– Even faster
Map/Reduce Dataflow
• Input: list of <key,
value> pairs
• Map task: for each
<k,v> pair, gives a
set of <k’,v’>
• Reduce task: for
each set of <k’,v’>
computes an
integer x
Map-reduce computation (1)
GFS data: Lines from ALL SHAKESPEARE’S PLAYS
• <1. «To be or not to be»>
• <2. «To die, to sleep … perhaps to dream»>
• ….
Lines mapped to MAP “line servers”. Start from sets of «word, 1» and compute entries
«word, k». Nodes for line 1 and 2:
• MAP1: MAP2:
To 1 To 1 “To” set (“To” is now the key):
To1 To 1 Node will compute «To, 3»
Be 1 To 1
Be1
Or 1 Sleep 1
Not 1 Perhaps 1
Die 1
Dream 1
Map-reduce computation (2)
Lines corresponding to keywords are sent to REDUCE
node.
• REDUCE NODE receives «To, k» data from line servers:
To 2
To 3
Received data with same key are reduced to histogram
entry:
«To 5»
MapReduce: Word Count Example
(Pseudocode)
map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, "1");

reduce(String output_key, Iterator intermediate_values):


// output_key: a word
// output_values: a list of counts
int result = 0; To 5
for each v in intermediate_values: Be 2
result += ParseInt(v); Sleep 1
Emit(AsString(result)); …

http://research.google.com/archive/mapreduce-osdi04-slides/index-auto-0004.html
Combiners
• Problem: too many sources
• Massive number of Reduce nodes
• Solution: do Reduce in two steps
– First: compute histogram for each single play (at
combiner nodes)
– Second: compute histogram of histograms (at final
reduce node)
– Function to be used is the same for Combiners
and for final Reduce
When it does not work
• Compute the average of word counts (rather
than histogram)
• With a single Reduce node: no problem
– Just divide each word counter by the total number
of words
• But: can you use recombiners?
When it does not work
• No, you cannot recombine
• For a given word, average of word count across all
Shakespare plays is NOT the average of per-play
averages

# wordi in all plays / #words in all plays <> S Playj #wordi in Playj/ #words in Playj

• Example: 7/1000 + 14/1100 <> 21/2100


Combiners’ limitation
• Reduce function must be associative
• Exercise 1: Find a way to compute the
average word count with combiners (can
be solved)
• Exercise 2: Find a way to compute the
median word count with combiners
(cannot be solved)
MapReduce Combiner
• Option function “Combiner” (to optimize bandwidth)
– If defined, runs after Mapper & before Reducer on every node that has
run a map task
– Combiner receives as input all data emitted by the Mapper instances
on a given node
– Combiner output sent to the Reducers, instead of the output from the
Mappers
– Is a "mini-reduce" process which operates only on data generated by
one machine
• If a reduce function is both commutative and associative, then
it can be used as a Combiner as well
• Useful for word count – combine local counts

Source: https://developer.yahoo.com/hadoop/tutorial/module4.html
Concept of hashing
• The problem at hand is to define and implement a
mapping from a domain of keys to a domain of
locations
• From the performance standpoint, the goal is to
avoid collisions (A collision occurs when two or
more keys map to the same location)
• From the compactness standpoint, no application
ever stores all keys in a domain simultaneously
unless the size of the domain is small
Concept of hashing (2)
• The information to be retrieved is stored in a hash
table which is best thought of as an array of m
locations, called buckets
• The mapping between a key and a bucket is called
the hash function
• The time to store and retrieve data is proportional
to the time to compute the hash function
Hashing function
• The ideal function, termed a perfect hash function,
would distribute all elements across the buckets such
that no collisions ever occurred
• h(v) = f(v) mod m
• Knuth (1973) suggests using as the value for m a
prime number
Hashing function (3)
• It is usually better to treat v as a sequence of bytes
and do one of the following for f(v):
(1) Sum or multiply all the bytes. Overflow can be
ignored
(2) Use the last (or middle) byte instead of the first
(3) Use the square of a few of the middle bytes
Hash in MapReduce
• A typical default is to hash the key and use the
hash value modulo the number of reducers.
• It is important to pick a partition function that
gives an approximately uniform distribution of
data per shard for load-balancing purposes,
otherwise the MapReduce operation can be held
up waiting for slow reducers (reducers assigned
more than their share of data) to finish.
• Will resume hashing in the third day
Example: Hash Join
•Read from two sets of reducer outputs that share the same
hashing buckets
•One is used as a build set and the other probe

merger merger merger

Read from every mapper for


reducer reducer reducer reducer reducer reducer
one designated partition

Use a hash partitioner mapper mapper mapper mapper mapper mapper

split split split split split split split split


MapReduce
performance gains are NOT automatic
• Your data go to a HDFS distributed file system.
– It will be horizontally partitioned (sharded)
• Initial location (entry shard) of a data point depends on its input
probe location
– In our Emirates commute example: the arrival Emirate of the commute
trip
• Ideal situation: stored data location is similar to the one required by
MapReduce mapping (e.g.: all people starting from Abu Dhabi
commute to the same place)

• This is seldom the case:


– You need internal (attribute-based) data routing to go from the initial
sharding to the MAP/REDUCE map (move the commuter data to
servers corresponding to starting point rather than arrivals)
The Latency Problem
• Data routing latency depends on
storage and network latencies
• Can kill overall performance
• Proactive data partitioning pays off
– but you must guess it right
Scaling storage: Sharding
• Vertical Partitioning: Have tables related to a specific feature sit on
their own server. May have to rebalance or “reshard” if tables
outgrow server.
• Range-Based Partitioning: When single table cannot sit on a server,
split table onto multiple servers. Split table based on some critical
value range.
• Key or Hash-Based Partitioning: Use a key value in a hash and use
the resulting value as entry into multiple servers.
• Directory-Based Partitioning: Have a lookup service that has
knowledge of the partitioning scheme . This allows for the adding of
servers or changing the partition scheme without changing the
application.
• http://adam.heroku.com/past/2009/7/6/sql_databases_dont_scale
The challenge
• Choose the partitioning strategy that will match
your application requirements Map function
– In the Emirates commute example, each time an
arrival is collected, route it to the SOURCE Emirate
chunk server
– The computation of average commute to destinations
for each source will find the data already in place
• When in doubt, route multiple copies!
• Don’t be afraid of redundancy (but be aware of a possible
conflict with privacy, more later)
Sharding SWOT
• Strenghts:
– Scales well for both reads and writes
• Weaknesses
– Not transparent, application needs to be partition-
aware
– Can no longer have relationships/joins across
partitions
– Loss of referential integrity across shards
Other scale-out techniques
• Master/Slave
– Original GFS solution
– Multi-Master replication
– INSERT only, not UPDATES/DELETES
– No JOINs, thereby reducing query time
• In-memory databases (SAP HANA, ORACLE)
The NOSQL ecosystem
NoSQL

94
The sources
BigTable
• Three major papers were the http://labs.google.com/papers/bigta
ble.html
seeds of the NoSQL movement Dynamo
http://www.allthingsdistributed.co
– BigTable (Google) m/2007/10/amazons_dynamo.html
and http://www.allthingsdistribute
– Dynamo (Amazon) d.com/files/amazon-dynamo-
sosp2007.pdf
• Gossip protocol (discovery and Amazon and consistency
error detection) http://www.allthingsdistributed.co
m/2010/02
• Distributed key-value data store http://www.allthingsdistributed.co
m/2008/12
• Eventual consistency
– CAP Theorem
ACID
Atomic: Either the whole process of a transaction is
done or none is.
Consistency: Database constraints (application-
specific) are preserved.
Isolation: It appears to the user as if only one process
executes at a time. (Two concurrent transactions will
not see on another’s transaction while “in flight”.)
Durability: The updates made to the database in a
committed transaction will be visible to future
transactions. (Effects of a process do not get lost if
the system crashes.)
CAP Theorem
• Proposed fifteen years go by Eric Brewer (talk on Principles of
Distributed Computing, July 2000).
• Three properties of a system: consistency, availability and partitions
• You can have at most two of these three properties for any shared-
data system
• To scale out, you have to partition. That leaves either consistency
or availability to choose from
– In almost all cases, you would choose availability over consistency
• Two kinds of consistency:
– strong consistency – ACID (Atomicity Consistency Isolation
Durability)
– weak consistency – BASE (Basically Available Soft-state Eventual
consistency )
To put it simple
• Many nodes
C
• Nodes contain replicas of
partitions of data

• Consistency
– all replicas contain the same
version of data
• Availability A P
– system remains operational
on failing nodes
• Partition tolarence CAP Theorem:
– multiple entry points satisfying all three at the
– system remains operational same time is impossible
on system split
98
CAP properties explained
• Partitionability: divide nodes into small groups that
can see other groups, but they can't see everyone.
• Consistency: write a value and then you read the value
you get the same value back. In a partitioned system
there are windows where that's not true.
• Availability: may not always be able to write or read.
The system will say you can't write because it wants to
keep the system consistent.
• To scale you have to partition: either high consistency
or high availability.
– Find the right overlap of availability and consistency.
Consistency Model
• A consistency model determines rules for visibility and apparent
order of updates.
• For example:
– Row R is replicated on nodes W and V
– Client A writes row R to node W
– Some period of time t elapses.
– Client B reads row X from node V
– Does client B see the write from client A?
– For No SQL, the answer is: “maybe”
• CAP Theorem again!: strict consistency cannot be achieved
at the same time as availability and partition-tolerance.
Eventual Consistency
• When no updates occur for a long period of
time, eventually all updates will propagate
through the system and all the nodes will be
consistent
• For a given accepted update and a given node,
eventually either the update reaches the node
or the node is removed from service
• As the data is written, the latest version is on
at least one node. The data is then
versioned/replicated to other nodes within the
system. Eventually, the same version is on all http://en.wikipedia.org/wiki/Eventual_
nodes. consistency
• Known as BASE (Basically Available, Soft state,
Eventual consistency), as opposed to ACID http://www.allthingsdistributed.com/2
008/12/eventually_consistent.html
From ACID to BASE
• large systems based on CAP are not ACID they are
BASE (http://queue.acm.org/detail.cfm?id=1394128):
• Basically Available - system seems to work all the
time
• Soft State – not consistent all the time
• Eventually Consistent - becomes consistent at
some later time
• Today, everyone who builds big applications
builds them on CAP and BASE: Google, Yahoo,
Facebook, Amazon, eBay, etc
Advice
• Choose a specific approach based on the needs of the
service.
• Example: in a e-commerce checkout process you
always want to honor requests to add items to a
shopping cart because it's revenue producing.
– In this case you choose high availability. Errors are hidden
from the customer and sorted out later.
• Instead: when a customer submits an order you favor
consistency
– Several services (credit card processing, shipping and
handling, reporting) are simultaneously accessing the data.
Schema-Less
Pros:
- Schema-less data model is richer than key/value
pairs
- eventual consistency
- many are distributed
- still provide excellent performance and scalability
Cons:
- Again, no ACID transactions or joins
Common Advantages
• Cheap, easy to implement (open source)
• Data are replicated to multiple nodes (therefore
identical and fault-tolerant) and can be
partitioned
– Down nodes easily replaced
– No single point of failure
• Easy to distribute
• No schema required
• Elasticity
• Relax the data consistency requirement (CAP)
What do we loose?
• Joins - group by - order by
• ACID transactions
• SQL as a sometimes frustrating but still
powerful query language
• easy integration with other applications
that support SQL

You might also like