SlideShare a Scribd company logo
Probabilistic Data Structures
and Approximate Solutions
IPython notebook with code >>

by Oleksandr Pryymak
PyData London 2014
Probabilistic||Approximate: Why?
Often:
● an approximate answer is sufficient
● need to trade accuracy for scalability or speed
● need to analyse stream of data
Catch:
● despite typically achieving good result, exists a
chance of the bad worst case behaviour.
● use on large datasets (law of large numbers)
Code: Approximation
import random
x = [random.randint(0,80000) for _ in xrange(10000)]
y = [i>>8 for i in x] # trim 8 bits off of integers
z = x[:500]

# 5% sample (x is uniform)

avx = average(x)
avy = average(y) * 2**8 # add 8 bits
avz = average(z)
print avx
print avy, 'error %.06f%%' % (100*abs(avx-avy)/float(avx))
print avz, 'error %.06f%%' % (100*abs(avx-avz)/float(avx))
39547.8816
39420.7744 error 0.321401%
39591.424 error 0.110100%
Code: Sampling Data

Interview question:
Get K samples from an infinite stream
Probabilistic Data Structures
Generally they are:
● Use less space than a full dataset
● Require higher CPU load
● Stream-friendly
● Can be parallelized
● Have controlled error rate
Hash functions
One-way function:
arbitrary length of the key ->
to a fixed length of the message

message = hash(key)
However, collisions are possible:

hash(key1) = hash(key2)
Code: Hashing
Hash collisions and performance
●
●

Cryptographic hashes not ideal for our use (like bcrypt)
Need a fast algorithm with the lowest number of collisions:

Hash
=============
Murmur
FNV-1
DJB2
SDBM
SuperFastHash
CRC32
LoseLose

Lowercase
=============
145 ns
6 collis
184 ns
1 collis
156 ns
7 collis
148 ns
4 collis
164 ns
85 collis
250 ns
2 collis
338 ns
215178 collis

Random UUID
===========
259 ns
5 collis
730 ns
5 collis
437 ns
6 collis
484 ns
6 collis
344 ns
4 collis
946 ns
0 collis
-

Numbers
==============
92 ns
0 collis
92 ns
0 collis
93 ns
0 collis
90 ns
0 collis
118 ns
18742 collis
130 ns
0 collis
-

Murmur2 collisions
●

cataract collides with periti

●

roquette collides with skivie

●

shawl collides with stormbound

●

dowlases collides with tramontane

●

cricketings collides with twanger

●

longans collides with whigs

by Ian Boyd: http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
Hash randomness visualised hashmap

Great
murmur2

Not so great

on a sequence of numbers

DJB2
on a sequence of numbers
Comparison: Locality Sensitive Hashing (LSH)
Comparison: Locality Sensitive Hashing (LSH)
Image hashes

Kernelized locality-sensitive hashing for scalable image search
B Kulis, K Grauman - Computer Vision, 2009 IEEE 12th …, 2009 - ieeexplore.ieee.org
Abstract Fast retrieval methods are critical for large-scale and data-driven vision applications. Recent work has explored ways to embed highdimensional features or complex distance functions into a low-dimensional Hamming space where items can be ... Cited by 22
Membership test: Bloom filter
Bloom filter is probabilistic but only yields false positives.
Hash each item k times indices into bit field.
`

At least one 0 means
w definitely isn’t in set.
All 1s would mean w
probably is in set.

1..m
Use Bloom filter to serve requests
Code: bloom filter
Use Bloom filter to store graphs
Graphs only gain nodes because of Bloom
filter false positives.

Pell et al., PNAS 2012
Counting Distinct Elements
In:
infinite stream of data
Question: how many distinct elements are there?
is similar to:
In:
coin flips
Question: how many times it has been flipped?
Coin flips: intuition
● Long runs of HEADs in random series are rare.
● The longer you look, the more likely you see a long one.
● Long runs are very rare and are correlated with how
many coins you’ve flipped.
Code: Cardinality estimation
Cardinality estimation
Basic algorithm:
●
●

n=0
For each input item:
○ Hash item into bit string
○ Count trailing zeroes in bit string
○ If this count > n:
■ Let n = count

●

Estimated cardinality (“count distinct”) = 2^n
Cardinality estimation: HyperLogLog

Demo by: http://www.
aggregateknowledge.
com/science/blog/hll.html
Billions of distinct values in 1.5KB of
RAM with 2% relative error
HyperLogLog: the analysis of a near-optimal
cardinality estimation algorithm
P.Flajolet, É.Fusy, O.Gandouet, F.Meunier;
2007
Code: HyperLogLog
Count-min sketch
Frequency histogram
estimation with chance
of over-counting

count(value) = min{w1[h1(value)], ... wd[hd(value)]}
Code: Frequent Itemsets
Machine Learning: Feature hashing
High-dimensional
machine learning without
feature dictionary

by Andrew Clegg “Approximate methods for
scalable data mining”
Locality-sensitive hashing
To approximate nearest
neighbours

by Andrew Clegg “Approximate methods for
scalable data mining”
Probabilistic Databases
● PrDB (University of Maryland)
● Orion (Purdue University)
● MayBMS (Cornell University)

● BlinkDB v0.1alpha
(UC Berkeley and MIT)
BlinkDB: queries
Queries with Bounded Errors
and Bounded Response Times
on Very Large Data
BlinkDB: architecture
References
Mining of Massive Datasets
by Jure Leskovec, Anand Rajaraman, and Jeff Ullman
http://infolab.stanford.edu/~ullman/mmds.html
Summary

● know the data structures
● know what you sacrifice
● control errors

http://nbviewer.ipython.org/gist/235/d3ee622926b5f77f03df
http://highlyscalable.wordpress.com/2012/05/01/probabilisticstructures-web-analytics-data-mining/ by Ilya Katsov

More Related Content

What's hot (19)

PPTX
Big Data Science with H2O in R
Anqi Fu
 
PPTX
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
Olga Lavrentieva
 
PDF
Cloud-based Data Stream Processing
Zbigniew Jerzak
 
PPTX
From Trill to Quill: Pushing the Envelope of Functionality and Scale
Badrish Chandramouli
 
PDF
Chronix Poster for the Poster Session FAST 2017
Florian Lautenschlager
 
PPTX
The Very ^ 2 Basics of R
Winston Chen
 
PDF
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Andrii Gakhov
 
PDF
R statistics with mongo db
MongoDB
 
PDF
Barcelona MUG MongoDB + Hadoop Presentation
Norberto Leite
 
PDF
Real-Time Big Data Stream Analytics
Albert Bifet
 
PPTX
Real-Time Integration Between MongoDB and SQL Databases
Eugene Dvorkin
 
PDF
Introduction to the Hadoop Ecosystem (codemotion Edition)
Uwe Printz
 
PDF
Sidi chang demo
Sidi Chang
 
PDF
Time Series Processing with Solr and Spark
Josef Adersberger
 
PPTX
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Paul Brebner
 
PDF
Sv big datascience_cliffclick_5_2_2013
Sri Ambati
 
PDF
AfterGlow
Raffael Marty
 
PDF
A survey paper on sequence pattern mining with incremental
Alexander Decker
 
PDF
20181116 Massive Log Processing using I/O optimized PostgreSQL
Kohei KaiGai
 
Big Data Science with H2O in R
Anqi Fu
 
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
Olga Lavrentieva
 
Cloud-based Data Stream Processing
Zbigniew Jerzak
 
From Trill to Quill: Pushing the Envelope of Functionality and Scale
Badrish Chandramouli
 
Chronix Poster for the Poster Session FAST 2017
Florian Lautenschlager
 
The Very ^ 2 Basics of R
Winston Chen
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Andrii Gakhov
 
R statistics with mongo db
MongoDB
 
Barcelona MUG MongoDB + Hadoop Presentation
Norberto Leite
 
Real-Time Big Data Stream Analytics
Albert Bifet
 
Real-Time Integration Between MongoDB and SQL Databases
Eugene Dvorkin
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Uwe Printz
 
Sidi chang demo
Sidi Chang
 
Time Series Processing with Solr and Spark
Josef Adersberger
 
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Paul Brebner
 
Sv big datascience_cliffclick_5_2_2013
Sri Ambati
 
AfterGlow
Raffael Marty
 
A survey paper on sequence pattern mining with incremental
Alexander Decker
 
20181116 Massive Log Processing using I/O optimized PostgreSQL
Kohei KaiGai
 

Viewers also liked (7)

PPTX
Hashing Technique In Data Structures
SHAKOOR AB
 
KEY
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Lorenzo Alberton
 
PPT
File organisation
Mukund Trivedi
 
PPT
Ch17 Hashing
leminhvuong
 
PPT
File structures
Shyam Kumar
 
PPTX
File Organization
Manyi Man
 
Hashing Technique In Data Structures
SHAKOOR AB
 
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Lorenzo Alberton
 
File organisation
Mukund Trivedi
 
Ch17 Hashing
leminhvuong
 
File structures
Shyam Kumar
 
File Organization
Manyi Man
 
Ad

Similar to Probabilistic Data Structures and Approximate Solutions (20)

PDF
Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak
PyData
 
PPTX
Probabilistic data structures
shrinivasvasala
 
PDF
Approximate methods for scalable data mining (long version)
Andrew Clegg
 
PPTX
streamingalgo88585858585858585pppppp.pptx
GopiNathVelivela
 
PPTX
Probabilistic data structure
Thinh Dang
 
PPTX
2013 open analytics_countingv3
Open Analytics
 
PDF
Counting (Using Computer)
roshmat
 
PPT
Approximate methods for scalable data mining
Andrew Clegg
 
PDF
Count-Distinct Problem
Kai Zhang
 
PDF
An introduction to probabilistic data structures
Miguel Ping
 
PPTX
Tech talk Probabilistic Data Structure
Rishabh Dugar
 
PPTX
Unit 5 Streams2.pptx
SonaliAjankar
 
PPTX
Data monsters probablistic data structures
GreenM
 
PPTX
HyperLogLog and friends
Simon Lia-Jonassen
 
PDF
Hash - A probabilistic approach for big data
Luca Mastrostefano
 
PDF
Probabilistic data structures. Part 2. Cardinality
Andrii Gakhov
 
PPTX
2013 py con awesome big data algorithms
c.titus.brown
 
PDF
2013 open analytics_countingv3
abramsm
 
PPTX
Data streaming algorithms
Sandeep Joshi
 
PDF
Perspective in Informatics 3 - Assignment 2 - Answer Sheet
Hoang Nguyen Phong
 
Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak
PyData
 
Probabilistic data structures
shrinivasvasala
 
Approximate methods for scalable data mining (long version)
Andrew Clegg
 
streamingalgo88585858585858585pppppp.pptx
GopiNathVelivela
 
Probabilistic data structure
Thinh Dang
 
2013 open analytics_countingv3
Open Analytics
 
Counting (Using Computer)
roshmat
 
Approximate methods for scalable data mining
Andrew Clegg
 
Count-Distinct Problem
Kai Zhang
 
An introduction to probabilistic data structures
Miguel Ping
 
Tech talk Probabilistic Data Structure
Rishabh Dugar
 
Unit 5 Streams2.pptx
SonaliAjankar
 
Data monsters probablistic data structures
GreenM
 
HyperLogLog and friends
Simon Lia-Jonassen
 
Hash - A probabilistic approach for big data
Luca Mastrostefano
 
Probabilistic data structures. Part 2. Cardinality
Andrii Gakhov
 
2013 py con awesome big data algorithms
c.titus.brown
 
2013 open analytics_countingv3
abramsm
 
Data streaming algorithms
Sandeep Joshi
 
Perspective in Informatics 3 - Assignment 2 - Answer Sheet
Hoang Nguyen Phong
 
Ad

More from Oleksandr Pryymak (8)

PDF
Information surprise or how to find interesting data
Oleksandr Pryymak
 
PPT
Efficient opinion sharing in large decentralised teams
Oleksandr Pryymak
 
PDF
Efficient Sharing of Conflicting Opinions with Minimal Communication in Large...
Oleksandr Pryymak
 
ODP
Semantic Web - Introduction
Oleksandr Pryymak
 
PPT
sumno.com - march 2009
Oleksandr Pryymak
 
PPT
Sumno.com (eng)
Oleksandr Pryymak
 
PPT
Sumno.com (ukr)
Oleksandr Pryymak
 
PDF
Gwt.org.ua (ukr)
Oleksandr Pryymak
 
Information surprise or how to find interesting data
Oleksandr Pryymak
 
Efficient opinion sharing in large decentralised teams
Oleksandr Pryymak
 
Efficient Sharing of Conflicting Opinions with Minimal Communication in Large...
Oleksandr Pryymak
 
Semantic Web - Introduction
Oleksandr Pryymak
 
sumno.com - march 2009
Oleksandr Pryymak
 
Sumno.com (eng)
Oleksandr Pryymak
 
Sumno.com (ukr)
Oleksandr Pryymak
 
Gwt.org.ua (ukr)
Oleksandr Pryymak
 

Recently uploaded (20)

PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Biography of Daniel Podor.pdf
Daniel Podor
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 

Probabilistic Data Structures and Approximate Solutions

  • 1. Probabilistic Data Structures and Approximate Solutions IPython notebook with code >> by Oleksandr Pryymak PyData London 2014
  • 2. Probabilistic||Approximate: Why? Often: ● an approximate answer is sufficient ● need to trade accuracy for scalability or speed ● need to analyse stream of data Catch: ● despite typically achieving good result, exists a chance of the bad worst case behaviour. ● use on large datasets (law of large numbers)
  • 3. Code: Approximation import random x = [random.randint(0,80000) for _ in xrange(10000)] y = [i>>8 for i in x] # trim 8 bits off of integers z = x[:500] # 5% sample (x is uniform) avx = average(x) avy = average(y) * 2**8 # add 8 bits avz = average(z) print avx print avy, 'error %.06f%%' % (100*abs(avx-avy)/float(avx)) print avz, 'error %.06f%%' % (100*abs(avx-avz)/float(avx)) 39547.8816 39420.7744 error 0.321401% 39591.424 error 0.110100%
  • 4. Code: Sampling Data Interview question: Get K samples from an infinite stream
  • 5. Probabilistic Data Structures Generally they are: ● Use less space than a full dataset ● Require higher CPU load ● Stream-friendly ● Can be parallelized ● Have controlled error rate
  • 6. Hash functions One-way function: arbitrary length of the key -> to a fixed length of the message message = hash(key) However, collisions are possible: hash(key1) = hash(key2)
  • 8. Hash collisions and performance ● ● Cryptographic hashes not ideal for our use (like bcrypt) Need a fast algorithm with the lowest number of collisions: Hash ============= Murmur FNV-1 DJB2 SDBM SuperFastHash CRC32 LoseLose Lowercase ============= 145 ns 6 collis 184 ns 1 collis 156 ns 7 collis 148 ns 4 collis 164 ns 85 collis 250 ns 2 collis 338 ns 215178 collis Random UUID =========== 259 ns 5 collis 730 ns 5 collis 437 ns 6 collis 484 ns 6 collis 344 ns 4 collis 946 ns 0 collis - Numbers ============== 92 ns 0 collis 92 ns 0 collis 93 ns 0 collis 90 ns 0 collis 118 ns 18742 collis 130 ns 0 collis - Murmur2 collisions ● cataract collides with periti ● roquette collides with skivie ● shawl collides with stormbound ● dowlases collides with tramontane ● cricketings collides with twanger ● longans collides with whigs by Ian Boyd: http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
  • 9. Hash randomness visualised hashmap Great murmur2 Not so great on a sequence of numbers DJB2 on a sequence of numbers
  • 11. Comparison: Locality Sensitive Hashing (LSH) Image hashes Kernelized locality-sensitive hashing for scalable image search B Kulis, K Grauman - Computer Vision, 2009 IEEE 12th …, 2009 - ieeexplore.ieee.org Abstract Fast retrieval methods are critical for large-scale and data-driven vision applications. Recent work has explored ways to embed highdimensional features or complex distance functions into a low-dimensional Hamming space where items can be ... Cited by 22
  • 12. Membership test: Bloom filter Bloom filter is probabilistic but only yields false positives. Hash each item k times indices into bit field. ` At least one 0 means w definitely isn’t in set. All 1s would mean w probably is in set. 1..m
  • 13. Use Bloom filter to serve requests
  • 15. Use Bloom filter to store graphs Graphs only gain nodes because of Bloom filter false positives. Pell et al., PNAS 2012
  • 16. Counting Distinct Elements In: infinite stream of data Question: how many distinct elements are there? is similar to: In: coin flips Question: how many times it has been flipped?
  • 17. Coin flips: intuition ● Long runs of HEADs in random series are rare. ● The longer you look, the more likely you see a long one. ● Long runs are very rare and are correlated with how many coins you’ve flipped.
  • 19. Cardinality estimation Basic algorithm: ● ● n=0 For each input item: ○ Hash item into bit string ○ Count trailing zeroes in bit string ○ If this count > n: ■ Let n = count ● Estimated cardinality (“count distinct”) = 2^n
  • 20. Cardinality estimation: HyperLogLog Demo by: http://www. aggregateknowledge. com/science/blog/hll.html Billions of distinct values in 1.5KB of RAM with 2% relative error HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm P.Flajolet, É.Fusy, O.Gandouet, F.Meunier; 2007
  • 22. Count-min sketch Frequency histogram estimation with chance of over-counting count(value) = min{w1[h1(value)], ... wd[hd(value)]}
  • 24. Machine Learning: Feature hashing High-dimensional machine learning without feature dictionary by Andrew Clegg “Approximate methods for scalable data mining”
  • 25. Locality-sensitive hashing To approximate nearest neighbours by Andrew Clegg “Approximate methods for scalable data mining”
  • 26. Probabilistic Databases ● PrDB (University of Maryland) ● Orion (Purdue University) ● MayBMS (Cornell University) ● BlinkDB v0.1alpha (UC Berkeley and MIT)
  • 27. BlinkDB: queries Queries with Bounded Errors and Bounded Response Times on Very Large Data
  • 29. References Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, and Jeff Ullman http://infolab.stanford.edu/~ullman/mmds.html
  • 30. Summary ● know the data structures ● know what you sacrifice ● control errors http://nbviewer.ipython.org/gist/235/d3ee622926b5f77f03df http://highlyscalable.wordpress.com/2012/05/01/probabilisticstructures-web-analytics-data-mining/ by Ilya Katsov