SlideShare a Scribd company logo
6.897: CSCI8980 Algorithmic Techniques for Big Data September 5, 2013
Lecture 1
Dr. Barna Saha Scribe: Vivek Mishra
Overview
We introduce data streaming model where streams of elements are coming in and main memory
space is not sufficient to hold all the data. We then look at the problem of finding frequent
items deterministically. This is a rare instance of data streaming algorithms that provides non-
trivial approximation guarantee deterministically. For most algorithms as we will see the answer
is approximate (close to optimal) and randomization is crucially used. To emphasize on the need
of randomization in designing data streaming algorithms, we next show that computing distinct
items (known as F0 computation) over a stream deterministically cannot achieve any approximation
without essentially storing the entire stream. Our next goal is to analyze the algorithm for distinct
items that have been covered in Lecture 1. Towards that goal, we study some basic concentration
inequality such as, Markov inequality, Chebyshev bound and The Chernoff bound.
1 Introduction to Data Streams
In a data streaming model sequence of elements a1 a2 a3 am arrive from a domain [1, n]. Each
element ai is a tuple (j, ν) where j ∈ [1, n] is an element from the domain and ν ∈ I. For simplicity,
we can consider ν = ±1 where +1 implies j is inserted in the stream and −1 implies j is deleted.
The goal is to process these elements using space (ideally) polylog of m and n, but definitely in
sub-linear in m and n. In the basic streaming setting, only a single pass over the data is allowed.
The time to process each update must be low.
When only insertions are allowed (all ν > 0), the model is known as cash-register model. On
the other hand, if both insertions and deletions are allowed, it is called turnstile model. For any
element, we generally do not allow the number of deletions (total negative frequency) to be more
than total number of insertions of that elements. However, if we do allow such scenarios, we will
refer it as general turnstile model.
2 Finding Frequent Items Deterministically
Here we describe an algorithm by Mishra and Gries [1] to find frequent items in the cash-register
model that is when only insertions are allowed. The precise problem is as follows.
Given a sequence of m elements from [1, n] and any k ∈ N, find all the elements with frequency
more than m
k where frequency simply implies the number of times an element occurs using space
O(k) (in words) and using only a single pass. Hence in bits, the total space usage is O(k log n).
We would like to have the following guarantees.
1
• No false negative. All items that have frequency > m
k will be reported.
There might be false positives, that is elements with frequency lower than m/k may be reported.
However, if we allow two passes, in the second pass all the elements reported in the first pass can
be checked for actual frequency and any false positive can be eliminated.
Description of the Algorithm
• Data structure: An associative array of k − 1 elements initialized to empty. An associative
array contains in each of its cell a key (the item) and a value (its count) and may be maintained
in a balanced binary search tree by its key. Whenever we see an element ai = (j, 1) we can
search in the array if j exists or not in O(log k) time.
• Procedure: Given arrival of ai = (j, 1), we search if j exists in the associative array. If the
element is already there, we increment its count. If it is not there and there is an empty cell,
we store it and make its count 1. If no space is left and the element is not there, we do not
store the element and decrement the counters of all the stored elements by 1. If a counter
reaches 0, that element is dropped from the associative array and the cell becomes empty. In
the following code, we use A to represent the associative array, and give the update process
when ai is seen in the stream.
Algorithm 1 [Process ai = (j, 1)]
Search for j ∈ A
if j is already present as key at cell A[l] then
Count[l] = Count[l] + 1 {i}ncrement its count
else if j is not present in A and ∃l such that A[l] is empty then
Insert j as key to A[l]
Count[l] = 1
else
Drop j {comment: j is not present and there is no free space}
for i = 1 to k − 1 do
Count[i] = Count[i] − 1
if Count[i] == 0 then
Drop the key associated with A[i] and mark A[i] empty
end if
end for
end if
The entire algorithm for a given k ∈ N is as follows
2.1 Analysis
We now prove the following theorem.
Theorem 1. For any given k ∈ N Algorithm 2 returns all items with frequency more than m
k .
2
Algorithm 2 Mishra-Gries Algorithm
Initialize A of size k − 1 to empty
for i = 1 to m do
Process(ai)
end for
return All the elements in A
Proof. Clearly, the number of items having frequency more than m
k is at most k −1 because stream
size is m. Let fj denote the actual frequency of item j and let ˆfj be the frequency of the item j as
observed in A at the end of the processing. If j is not in A, we assume ˆfj = 0. We therefore return
all elements with ˆfj > 0.
Note that we increment frequency for j only if we see the actual item. Hence ˆfj ≤ fj. We want to
find out how low it can be. If we never drop the element then ˆfj = fj. Otherwise, either at some
point when j occurs, array A is full and j is not already there. Or because it is dropped from array
due to its count being decremented to 0.
Note that whenever an item is dropped (due to no space or count getting decremented to 0), there
are other distinct k − 1 elements whose count also gets decremented. Here we view the count of an
element on arrival is increased to 1, but it is dropped to 0 if it cannot be stored. Therefore, there
can be at most m
k steps on which element counts are decremented, k distinct elements at one shot.
The reasoning being the stream size is m and all frequencies are non-negative.
For each of these events, the difference between the computed frequency and the actual frequency
can increase by 1 and hence altogether we have ˆfj ≥ fj − m
k for all j ∈ [1, n].
Therefore, for all j ∈ [1, n] if fj > m
k then ˆfj > 0 and hence it is stored in the array at the end and
will be reported.
In most data streaming algorithms, one cannot achieve any non-trivial approximation determinis-
tically. In the following section, we come back to counting distinct items and show no deterministic
algorithm is possible that in o(m) space can give an exact count for distinct elements.
3 Lower Bound for Deterministic Computation of Distinct Ele-
ments
We prove the following theorem here.
Theorem 2. There exists no deterministic algorithm that returns the exact count for distinct items
in stream of size m in o(m) bits.
Proof. We will prove this by the by contradiction. Suppose it is possible to have an exact estimate
of distinct elements using space o(m) bits. Let R be such an algorithm. Since R uses o(m) bits,
the number of different possible configurations that R can maintain is at-most 2o(m). We let n = m
and consider all the different streams which have exactly m
2 distinct elements. How many such
3
streams are possible ? Clearly, the number of such streams is at least
m
m/2
≈
me
m/2
m
2
= (2e)
me
2 = 2Θ(m)
where the second inequality comes from Stirling’s approximation.
Since the number of such streams is more than the number of available configurations of R, there
must exist two streams y, y , y = y such that both have the same configuration, that is, R(y) =
R(y ) and the distinct items in y are not identical to distinct items of y . We now consider two
streams Y1 = y + y and Y2 = y + y where + represents concatenation here, that is stream Y1
y is followed by y and in stream Y2, y is followed by y. Since R(y) = R(y ), we must have
R(y + y) = R(y + y ). Therefore R will return same distinct elements counts for both Y1 and Y2
which is wrong because the number of distinct elements in Y1 is m
2 where for Y2 it is > m
2 .
The above proof can be extended to show that there does not exist any deterministic algorithm
with space o(m) that returns a count of distinct items within a multiplicative factor less than 2.
Similarly, one can show that no exact randomized algorithm can exist either in o(m) space.
Surprisingly, when we allow both approximation and randomization, the space usage can be dras-
tically reduced (next lecture).
4 Basic Concentration Inequalities
Here we study three basic concentration inequalities which bound deviation from expectation.
1. Markov inequality ( The 1st moment inequality)
2. Chebyshev inequality( The 2nd moment inequality)
3. The Chernoff Bound
Theorem 3 (Markov Bound). For any positive random variable X, and for any t > 0
Pr (X ≥ t) ≤
E[x]
t
(1)
Proof.
E[x] =
x
x · Pr(X = x)
=
x<t
x · Pr(X = x) +
x≥t
x · Pr(X = x)
≥ 0 + t ·
x≥t
Pr(X = x)
= t · P(X ≥ t)
4
Theorem 4 (Chebyshev Inequality). For any random variable X and for any t > 0
Pr(|X − E[x]| ≥ t) ≤
V ar(x)
t2
(2)
Proof.
Pr(|X − E[x]| ≥ t)
= Pr([X − E[x]]2
≥ t2
)
≤
E (X − E[x])2
t2
=
V ar(X)
t2
Theorem 5 (The Chernoff Bound). Let X1, X2...Xn be n independent Bernoulli random variables
with Pr(Xi = 1) = pi. Let X = Xi. Hence,
E[X] = E Xi = E [Xi] = Pr(Xi = 1) = pi = µ(say).
Then the Chernoff Bound says for any > 0
Pr(X > (1 + )µ) ≤
e
(1 + )
µ
and
Pr(X < (1 − )µ) ≤
e−
(1 − )1−
µ
When 0 < < 1 the above expression can be further simplified to
Pr(X > (1 + )µ) ≤ e
−µ 2
3 and
Pr(X < (1 − )µ) ≤ e
−µ 2
2
Hence
Pr(|X − µ| > µ) ≤ 2e
−µ 2
3
References
[1] Jayadev Misra and David Gries. Finding repeated elements. Sci. Comput. Program.,
2(2):143152, 1982.
5

More Related Content

PPTX
Stochastic Processes Homework Help
Statistics Assignment Help
 
PPTX
Mechanical Engineering Assignment Help
Matlab Assignment Experts
 
PPTX
Statistical Physics Assignment Help
Statistics Assignment Help
 
PDF
Sienna 4 divideandconquer
chidabdu
 
PPTX
Stochastic Processes Assignment Help
Statistics Assignment Help
 
PPTX
Matlab Assignment Help
Matlab Assignment Experts
 
PPS
Greedy Algorithms with examples' b-18298
LGS, GBHS&IC, University Of South-Asia, TARA-Technologies
 
Stochastic Processes Homework Help
Statistics Assignment Help
 
Mechanical Engineering Assignment Help
Matlab Assignment Experts
 
Statistical Physics Assignment Help
Statistics Assignment Help
 
Sienna 4 divideandconquer
chidabdu
 
Stochastic Processes Assignment Help
Statistics Assignment Help
 
Matlab Assignment Help
Matlab Assignment Experts
 
Greedy Algorithms with examples' b-18298
LGS, GBHS&IC, University Of South-Asia, TARA-Technologies
 

What's hot (20)

PPTX
Application of calculus in everyday life
Mohamed Ibrahim
 
PPT
Cis435 week02
ashish bansal
 
PPT
Indexing Text with Approximate q-grams
Yasmine Long
 
PDF
MMath Paper, Canlin Zhang
canlin zhang
 
PDF
MASTER_THESIS-libre
Siddhartha Ray Choudhuri
 
PPTX
Ms nikita greedy agorithm
Nikitagupta123
 
PPT
Ph 101-9 QUANTUM MACHANICS
Chandan Singh
 
PPTX
Diffusion Homework Help
Statistics Assignment Help
 
PDF
Linear sorting
Krishna Chaytaniah
 
PPTX
Physics Assignment Help
Statistics Homework Helper
 
PPT
Numerical Methods
ESUG
 
PDF
220exercises2
sadhanakumble
 
PDF
Firefly exact MCMC for Big Data
Gianvito Siciliano
 
RTF
algorithm unit 1
Monika Choudhery
 
PDF
Sortsearch
Krishna Chaytaniah
 
PPTX
Application of Schrodinger Equation to particle in one Dimensional Box: Energ...
limbraj Ravangave
 
PDF
Differential equations final -mams
armanimams
 
Application of calculus in everyday life
Mohamed Ibrahim
 
Cis435 week02
ashish bansal
 
Indexing Text with Approximate q-grams
Yasmine Long
 
MMath Paper, Canlin Zhang
canlin zhang
 
MASTER_THESIS-libre
Siddhartha Ray Choudhuri
 
Ms nikita greedy agorithm
Nikitagupta123
 
Ph 101-9 QUANTUM MACHANICS
Chandan Singh
 
Diffusion Homework Help
Statistics Assignment Help
 
Linear sorting
Krishna Chaytaniah
 
Physics Assignment Help
Statistics Homework Helper
 
Numerical Methods
ESUG
 
220exercises2
sadhanakumble
 
Firefly exact MCMC for Big Data
Gianvito Siciliano
 
algorithm unit 1
Monika Choudhery
 
Application of Schrodinger Equation to particle in one Dimensional Box: Energ...
limbraj Ravangave
 
Differential equations final -mams
armanimams
 
Ad

Viewers also liked (19)

PPT
чебоксары
Atner Yegorov
 
PPT
бойко. 21ноября круглый стол
Atner Yegorov
 
PPT
социокультурный портрет Чувашской Республики 2013
Atner Yegorov
 
PDF
Lecture5
Atner Yegorov
 
PPT
о реализации РЦП "БДД Чувашии 2006-2012"
Atner Yegorov
 
PDF
Lecture4
Atner Yegorov
 
PDF
Algorithmic techniques-for-big-data-analysis
Atner Yegorov
 
PPT
кку тракторные заводы
Atner Yegorov
 
PPT
презентация инвест предложения_чр_январь_2012_табаков
Atner Yegorov
 
PPT
ВСМ-2
Atner Yegorov
 
PPT
Prezentazia ples
Atner Yegorov
 
PPT
зао чэаз
Atner Yegorov
 
PPT
Efimov goroda upravlenie_razvitiem_gorodov_2010
Atner Yegorov
 
PPS
It 2012 3
Atner Yegorov
 
PPT
Prezentazia kostroma
Atner Yegorov
 
PDF
Prezentaciya rab glave_17_fevralya
Atner Yegorov
 
PPT
птицефабрика
Atner Yegorov
 
PPS
It gl-gor2013
Atner Yegorov
 
PPT
презентация топинамбур
Atner Yegorov
 
чебоксары
Atner Yegorov
 
бойко. 21ноября круглый стол
Atner Yegorov
 
социокультурный портрет Чувашской Республики 2013
Atner Yegorov
 
Lecture5
Atner Yegorov
 
о реализации РЦП "БДД Чувашии 2006-2012"
Atner Yegorov
 
Lecture4
Atner Yegorov
 
Algorithmic techniques-for-big-data-analysis
Atner Yegorov
 
кку тракторные заводы
Atner Yegorov
 
презентация инвест предложения_чр_январь_2012_табаков
Atner Yegorov
 
ВСМ-2
Atner Yegorov
 
Prezentazia ples
Atner Yegorov
 
зао чэаз
Atner Yegorov
 
Efimov goroda upravlenie_razvitiem_gorodov_2010
Atner Yegorov
 
It 2012 3
Atner Yegorov
 
Prezentazia kostroma
Atner Yegorov
 
Prezentaciya rab glave_17_fevralya
Atner Yegorov
 
птицефабрика
Atner Yegorov
 
It gl-gor2013
Atner Yegorov
 
презентация топинамбур
Atner Yegorov
 
Ad

Similar to Lec12 (20)

PPTX
streamingalgo88585858585858585pppppp.pptx
GopiNathVelivela
 
PPTX
Data streaming algorithms
Sandeep Joshi
 
PDF
Probabilistic data structures. Part 3. Frequency
Andrii Gakhov
 
PDF
Approximation Data Structures for Streaming Applications
Debasish Ghosh
 
PPT
Lecture20
mattriley
 
PDF
Lec2 slides
Atner Yegorov
 
PDF
Algorithmic techniques-for-big-data-analysis
Hiye Biniam
 
PDF
Count-min sketch to Infinity.pdf
Stephen Lorello
 
PPTX
Mining Data Streams
SujaAldrin
 
PDF
Perspective in Informatics 3 - Assignment 2 - Answer Sheet
Hoang Nguyen Phong
 
PDF
Lec 3-mcgregor
Atner Yegorov
 
PPTX
Unit 5 Streams2.pptx
SonaliAjankar
 
PPTX
Data Mining Lecture_4.pptx
Subrata Kumer Paul
 
PDF
Cs6402 design and analysis of algorithms may june 2016 answer key
appasami
 
PDF
Dynamic Programming From CS 6515(Fibonacci, LIS, LCS))
leoyang0406
 
PPTX
Streaming Algorithms
Joe Kelley
 
PDF
Counting (Using Computer)
roshmat
 
PDF
accessible-streaming-algorithms
Farhan Zaki
 
PDF
ME Synopsis
Poonam Debnath
 
PDF
Lop1
devendragiitk
 
streamingalgo88585858585858585pppppp.pptx
GopiNathVelivela
 
Data streaming algorithms
Sandeep Joshi
 
Probabilistic data structures. Part 3. Frequency
Andrii Gakhov
 
Approximation Data Structures for Streaming Applications
Debasish Ghosh
 
Lecture20
mattriley
 
Lec2 slides
Atner Yegorov
 
Algorithmic techniques-for-big-data-analysis
Hiye Biniam
 
Count-min sketch to Infinity.pdf
Stephen Lorello
 
Mining Data Streams
SujaAldrin
 
Perspective in Informatics 3 - Assignment 2 - Answer Sheet
Hoang Nguyen Phong
 
Lec 3-mcgregor
Atner Yegorov
 
Unit 5 Streams2.pptx
SonaliAjankar
 
Data Mining Lecture_4.pptx
Subrata Kumer Paul
 
Cs6402 design and analysis of algorithms may june 2016 answer key
appasami
 
Dynamic Programming From CS 6515(Fibonacci, LIS, LCS))
leoyang0406
 
Streaming Algorithms
Joe Kelley
 
Counting (Using Computer)
roshmat
 
accessible-streaming-algorithms
Farhan Zaki
 
ME Synopsis
Poonam Debnath
 

More from Atner Yegorov (20)

PDF
Здравоохранение Чувашии 2015
Atner Yegorov
 
PDF
Чувашия. Итоги за 5 лет. 2015
Atner Yegorov
 
PDF
Инвестиционная политика Чувашской Республики 2015
Atner Yegorov
 
PDF
Господдержка малого и среднего бизнеса в Чувашии 2015
Atner Yegorov
 
PDF
ИННОВАЦИИ В ПРОМЫШЛЕННОСТИ ЧУВАШИИ. 2015
Atner Yegorov
 
PPTX
Брендинг Чувашии
Atner Yegorov
 
PPTX
VIII Экономический форум Чебоксары
Atner Yegorov
 
PDF
Рейтин инвестклимата в регионах 2015
Atner Yegorov
 
PDF
Аттракционы тематического парка, Татарстан
Atner Yegorov
 
PDF
Тематический парк, Татарстан
Atner Yegorov
 
PDF
Будущее журналистики
Atner Yegorov
 
PDF
Будущее нишевых СМИ
Atner Yegorov
 
PDF
Relap
Atner Yegorov
 
PDF
Будущее онлайн СМИ в регионах
Atner Yegorov
 
PDF
Война форматов. Как люди читают новости
Atner Yegorov
 
PDF
Национальная технологическая инициатива
Atner Yegorov
 
PDF
О РАЗРАБОТКЕ И РЕАЛИЗАЦИИ НАЦИОНАЛЬНОЙ ТЕХНОЛОГИЧЕСКОЙ ИНИЦИАТИВЫ
Atner Yegorov
 
PDF
Участники заседания по модернизации
Atner Yegorov
 
PDF
Город Innopolis
Atner Yegorov
 
PPTX
презентация1
Atner Yegorov
 
Здравоохранение Чувашии 2015
Atner Yegorov
 
Чувашия. Итоги за 5 лет. 2015
Atner Yegorov
 
Инвестиционная политика Чувашской Республики 2015
Atner Yegorov
 
Господдержка малого и среднего бизнеса в Чувашии 2015
Atner Yegorov
 
ИННОВАЦИИ В ПРОМЫШЛЕННОСТИ ЧУВАШИИ. 2015
Atner Yegorov
 
Брендинг Чувашии
Atner Yegorov
 
VIII Экономический форум Чебоксары
Atner Yegorov
 
Рейтин инвестклимата в регионах 2015
Atner Yegorov
 
Аттракционы тематического парка, Татарстан
Atner Yegorov
 
Тематический парк, Татарстан
Atner Yegorov
 
Будущее журналистики
Atner Yegorov
 
Будущее нишевых СМИ
Atner Yegorov
 
Будущее онлайн СМИ в регионах
Atner Yegorov
 
Война форматов. Как люди читают новости
Atner Yegorov
 
Национальная технологическая инициатива
Atner Yegorov
 
О РАЗРАБОТКЕ И РЕАЛИЗАЦИИ НАЦИОНАЛЬНОЙ ТЕХНОЛОГИЧЕСКОЙ ИНИЦИАТИВЫ
Atner Yegorov
 
Участники заседания по модернизации
Atner Yegorov
 
Город Innopolis
Atner Yegorov
 
презентация1
Atner Yegorov
 

Recently uploaded (20)

PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PDF
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 

Lec12

  • 1. 6.897: CSCI8980 Algorithmic Techniques for Big Data September 5, 2013 Lecture 1 Dr. Barna Saha Scribe: Vivek Mishra Overview We introduce data streaming model where streams of elements are coming in and main memory space is not sufficient to hold all the data. We then look at the problem of finding frequent items deterministically. This is a rare instance of data streaming algorithms that provides non- trivial approximation guarantee deterministically. For most algorithms as we will see the answer is approximate (close to optimal) and randomization is crucially used. To emphasize on the need of randomization in designing data streaming algorithms, we next show that computing distinct items (known as F0 computation) over a stream deterministically cannot achieve any approximation without essentially storing the entire stream. Our next goal is to analyze the algorithm for distinct items that have been covered in Lecture 1. Towards that goal, we study some basic concentration inequality such as, Markov inequality, Chebyshev bound and The Chernoff bound. 1 Introduction to Data Streams In a data streaming model sequence of elements a1 a2 a3 am arrive from a domain [1, n]. Each element ai is a tuple (j, ν) where j ∈ [1, n] is an element from the domain and ν ∈ I. For simplicity, we can consider ν = ±1 where +1 implies j is inserted in the stream and −1 implies j is deleted. The goal is to process these elements using space (ideally) polylog of m and n, but definitely in sub-linear in m and n. In the basic streaming setting, only a single pass over the data is allowed. The time to process each update must be low. When only insertions are allowed (all ν > 0), the model is known as cash-register model. On the other hand, if both insertions and deletions are allowed, it is called turnstile model. For any element, we generally do not allow the number of deletions (total negative frequency) to be more than total number of insertions of that elements. However, if we do allow such scenarios, we will refer it as general turnstile model. 2 Finding Frequent Items Deterministically Here we describe an algorithm by Mishra and Gries [1] to find frequent items in the cash-register model that is when only insertions are allowed. The precise problem is as follows. Given a sequence of m elements from [1, n] and any k ∈ N, find all the elements with frequency more than m k where frequency simply implies the number of times an element occurs using space O(k) (in words) and using only a single pass. Hence in bits, the total space usage is O(k log n). We would like to have the following guarantees. 1
  • 2. • No false negative. All items that have frequency > m k will be reported. There might be false positives, that is elements with frequency lower than m/k may be reported. However, if we allow two passes, in the second pass all the elements reported in the first pass can be checked for actual frequency and any false positive can be eliminated. Description of the Algorithm • Data structure: An associative array of k − 1 elements initialized to empty. An associative array contains in each of its cell a key (the item) and a value (its count) and may be maintained in a balanced binary search tree by its key. Whenever we see an element ai = (j, 1) we can search in the array if j exists or not in O(log k) time. • Procedure: Given arrival of ai = (j, 1), we search if j exists in the associative array. If the element is already there, we increment its count. If it is not there and there is an empty cell, we store it and make its count 1. If no space is left and the element is not there, we do not store the element and decrement the counters of all the stored elements by 1. If a counter reaches 0, that element is dropped from the associative array and the cell becomes empty. In the following code, we use A to represent the associative array, and give the update process when ai is seen in the stream. Algorithm 1 [Process ai = (j, 1)] Search for j ∈ A if j is already present as key at cell A[l] then Count[l] = Count[l] + 1 {i}ncrement its count else if j is not present in A and ∃l such that A[l] is empty then Insert j as key to A[l] Count[l] = 1 else Drop j {comment: j is not present and there is no free space} for i = 1 to k − 1 do Count[i] = Count[i] − 1 if Count[i] == 0 then Drop the key associated with A[i] and mark A[i] empty end if end for end if The entire algorithm for a given k ∈ N is as follows 2.1 Analysis We now prove the following theorem. Theorem 1. For any given k ∈ N Algorithm 2 returns all items with frequency more than m k . 2
  • 3. Algorithm 2 Mishra-Gries Algorithm Initialize A of size k − 1 to empty for i = 1 to m do Process(ai) end for return All the elements in A Proof. Clearly, the number of items having frequency more than m k is at most k −1 because stream size is m. Let fj denote the actual frequency of item j and let ˆfj be the frequency of the item j as observed in A at the end of the processing. If j is not in A, we assume ˆfj = 0. We therefore return all elements with ˆfj > 0. Note that we increment frequency for j only if we see the actual item. Hence ˆfj ≤ fj. We want to find out how low it can be. If we never drop the element then ˆfj = fj. Otherwise, either at some point when j occurs, array A is full and j is not already there. Or because it is dropped from array due to its count being decremented to 0. Note that whenever an item is dropped (due to no space or count getting decremented to 0), there are other distinct k − 1 elements whose count also gets decremented. Here we view the count of an element on arrival is increased to 1, but it is dropped to 0 if it cannot be stored. Therefore, there can be at most m k steps on which element counts are decremented, k distinct elements at one shot. The reasoning being the stream size is m and all frequencies are non-negative. For each of these events, the difference between the computed frequency and the actual frequency can increase by 1 and hence altogether we have ˆfj ≥ fj − m k for all j ∈ [1, n]. Therefore, for all j ∈ [1, n] if fj > m k then ˆfj > 0 and hence it is stored in the array at the end and will be reported. In most data streaming algorithms, one cannot achieve any non-trivial approximation determinis- tically. In the following section, we come back to counting distinct items and show no deterministic algorithm is possible that in o(m) space can give an exact count for distinct elements. 3 Lower Bound for Deterministic Computation of Distinct Ele- ments We prove the following theorem here. Theorem 2. There exists no deterministic algorithm that returns the exact count for distinct items in stream of size m in o(m) bits. Proof. We will prove this by the by contradiction. Suppose it is possible to have an exact estimate of distinct elements using space o(m) bits. Let R be such an algorithm. Since R uses o(m) bits, the number of different possible configurations that R can maintain is at-most 2o(m). We let n = m and consider all the different streams which have exactly m 2 distinct elements. How many such 3
  • 4. streams are possible ? Clearly, the number of such streams is at least m m/2 ≈ me m/2 m 2 = (2e) me 2 = 2Θ(m) where the second inequality comes from Stirling’s approximation. Since the number of such streams is more than the number of available configurations of R, there must exist two streams y, y , y = y such that both have the same configuration, that is, R(y) = R(y ) and the distinct items in y are not identical to distinct items of y . We now consider two streams Y1 = y + y and Y2 = y + y where + represents concatenation here, that is stream Y1 y is followed by y and in stream Y2, y is followed by y. Since R(y) = R(y ), we must have R(y + y) = R(y + y ). Therefore R will return same distinct elements counts for both Y1 and Y2 which is wrong because the number of distinct elements in Y1 is m 2 where for Y2 it is > m 2 . The above proof can be extended to show that there does not exist any deterministic algorithm with space o(m) that returns a count of distinct items within a multiplicative factor less than 2. Similarly, one can show that no exact randomized algorithm can exist either in o(m) space. Surprisingly, when we allow both approximation and randomization, the space usage can be dras- tically reduced (next lecture). 4 Basic Concentration Inequalities Here we study three basic concentration inequalities which bound deviation from expectation. 1. Markov inequality ( The 1st moment inequality) 2. Chebyshev inequality( The 2nd moment inequality) 3. The Chernoff Bound Theorem 3 (Markov Bound). For any positive random variable X, and for any t > 0 Pr (X ≥ t) ≤ E[x] t (1) Proof. E[x] = x x · Pr(X = x) = x<t x · Pr(X = x) + x≥t x · Pr(X = x) ≥ 0 + t · x≥t Pr(X = x) = t · P(X ≥ t) 4
  • 5. Theorem 4 (Chebyshev Inequality). For any random variable X and for any t > 0 Pr(|X − E[x]| ≥ t) ≤ V ar(x) t2 (2) Proof. Pr(|X − E[x]| ≥ t) = Pr([X − E[x]]2 ≥ t2 ) ≤ E (X − E[x])2 t2 = V ar(X) t2 Theorem 5 (The Chernoff Bound). Let X1, X2...Xn be n independent Bernoulli random variables with Pr(Xi = 1) = pi. Let X = Xi. Hence, E[X] = E Xi = E [Xi] = Pr(Xi = 1) = pi = µ(say). Then the Chernoff Bound says for any > 0 Pr(X > (1 + )µ) ≤ e (1 + ) µ and Pr(X < (1 − )µ) ≤ e− (1 − )1− µ When 0 < < 1 the above expression can be further simplified to Pr(X > (1 + )µ) ≤ e −µ 2 3 and Pr(X < (1 − )µ) ≤ e −µ 2 2 Hence Pr(|X − µ| > µ) ≤ 2e −µ 2 3 References [1] Jayadev Misra and David Gries. Finding repeated elements. Sci. Comput. Program., 2(2):143152, 1982. 5