SlideShare a Scribd company logo
International Journal of Science and Research (IJSR), India Online ISSN: 2319-7064
Volume 2 Issue 9, September 2013
www.ijsr.net
Minimize Staleness and Stretch in Streaming Data
Warehouses
S. M. Subhani1
, M. Nagendramma2
1, 2
Department of CSE, BVSREC, Chimakurthy, A.P, India
Abstract: We study scheduling algorithms for loading data feeds into real time data warehouses, which are used in applications such
as IP network monitoring, online financial trading, and credit card fraud detection. In these applications, the warehouse collects a
large number of streaming data feeds that are generated by external sources and arrive asynchronously. We discuss update scheduling
in streaming data warehouses, which combine the features of traditional data warehouses and data stream systems. In our setting,
external sources push append-only data streams into the warehouse with a wide range of inter-arrival times. While traditional data
warehouses are typically refreshed during downtimes, streaming warehouses are updated as new data arrive. In this paper we
develop a theory of temporal consistency for stream warehouses that allows for multiple consistency levels. We model the
streaming warehouse update problem as a scheduling problem, where jobs correspond to processes that load new data into tables,
and whose objective is to minimize data staleness over time.
Keywords: Data warehouse maintenance, online scheduling
1. Introduction
The goal of a streaming warehouse is to propagate new
data across all the relevant tables and views as quickly as
possible. Once new data are loaded, the applications and
triggers defined on the warehouse can take immediate
action. This allows businesses to make decisions in nearly
real time, which may lead to increased profits, improved
customer satisfaction, and prevention of serious problems
that could develop if no action was taken.
Data warehouses integrate information from multiple
operational databases to enable complex business analyses.
In traditional applications, warehouses are updated
periodically and data analysis is done off-line [3]. In
contrast, real time warehouses [1], also known as active
warehouses [4], continually load incoming data feeds to
support time-critical analyses. For instance, an Internet
Service Provider (ISP) may collect streams of network
configuration and performance data generated by remote
sources in nearly real time. New data must be loaded in a
timely manner and correlated against historical data to
quickly identify network anomalies, denial-of-service
attacks, and inconsistencies among protocol layers.
Similarly, online stock trading applications may discover
profit opportunities by comparing recent transactions in
nearly real time against historical trends. Banks may be
interested in analyzing incoming streams of credit card
transactions to protect customers against identity theft.
Since the effectiveness of a real time warehouse depends on
its ability to ingest new data, we study problems related to
data staleness. In our setting, each table in the warehouse
collects data from an external source. The arrival of a set of
new data releases an update that seeks to append the data to
the corresponding table. Since existing data are not
modified, the processing time of an update is at most
proportional to the amount of new data.
Our first objective is to nonpreemptively1 schedule the
updates on one or more processors in a way that minimizes
the total staleness of all tables. Our first contribution
answers a question implicit in [2] regarding the difficulty of
this problem. We prove that even in the purely online
model, any on-line non preemptive algorithm achieves
staleness at most a constant factor times optimal, provided
that no processor is ever voluntarily idle and provided that
the processors are sufficiently fast.
2. System Model
2.1Warehousing Architecture
Figure 1 illustrates a streaming data warehouse. Each data
stream is generated by an external source, with a batch of
new data, consisting of one or more records being pushed to
the warehouse with period pi. If the period of a stream is
unknown or unpredictable, we let the user choose a period
with which the warehouse should check for new data.
Examples of streams collected by an Internet Service
provider include router performance statistics such as CPU
usage, system logs, routing table updates, link layer alerts
etc.An important property of the data streams in our
motivating applications is that they are append-only,i.e.,
existing records are never modified or deleted. For example,
a stream of average router CPU utilization measurement
may consist of records with fields (timestamp, router name,
CPU utilization) and a new data file with updated CPU
measurement for each router may arrive at the warehouse
every five minutes. [5]
Figure 1: Stream data warehouse
A streaming data warehouse maintains two types of tables:
base and derived. Each table may be stored partially or
Paper ID: 21091302 375
International Journal of Science and Research (IJSR), India Online ISSN: 2319-7064
Volume 2 Issue 9, September 2013
www.ijsr.net
wholly on disk. A base table is loaded directly from a data
stream. A derived table is a materialized view defined over
one or more tables. Each base or derived table Tj has a user
–defined priority pj and a time-dependent staleness function
sj(τ) that will be defined shortly. Relationships among
source and derived tables form a (directed and acyclic)
dependency graph. For each table Tj, we define a set of its
ancestor tables as those which directly or indirectly serve as
its sources, and a set of its dependent tables as those which
are directly or indirectly sourced from Tj. For example, T1,
T2 and T3 are ancestors of T4, and T3 and T4 are dependents
of T1.In practice, warehouse tables are horizontally
partitioned by time so that only a small number of recent
partitions are affected by updates [6][7].
2.2 Earliest Deadline First (EDF)
EDF has been proven to be an optimal uniprocessor
scheduling algorithms .This means that if a set of tasks is
unschedulable under EDF, then no other scheduling
algorithm can feasible schedule this task set. The EDF
algorithm chooses for execution at each instant in the time
currently active job(s) that have the nearest deadlines. The
EDF implementation upon uniform parallel machines is
according to the following rules, No Processor is idled while
there are active jobs waiting for execution, when fewer then
m jobs are active, they are required to execute on the fastest
processor while the slowest are idled, and higher priority
jobs are executed on faster processors. In Earliest Deadline
First scheduling, at every scheduling point the task having
the shortest deadline is taken up for scheduling. A task is
schedule under EDF, if and only if it satisfies the condition
that total processor utilization (Ui) due to the task set is less
than 1.
The Aim of this work is to provide a sensitivity analysis for
task deadline context of multiprocessor system by using a
new approach of EFDF (Earliest Feasible Deadline First)
algorithm. In order to decrease the number of migrations we
prevent a job from moving one processor to another
processor if it is among them higher priority jobs. Therefore,
a job will continue its execution on the same processor if
possible (processor affinity). The result of these comparisons
outlines some situations where one scheme is preferable over
the other. Partitioning schemes are better suited for hard real-
time systems, while a global scheme is preferable for soft
real-time systems.
The final EDF – partitioned scheduling algorithm is
following
1.Sort the released jobs by the local algorithm
2.For each job ji in sorted order
a) If ji’s home track is available, schedule ji on its home
track
b) Else, if there is an available free track, schedule ji on the
free track
c) Else, scan through the tracks r such that ji can be
promoted to track r
i) If track r is free and there is no released job
remaining in the sorted list for home track r,
ii) Schedule ji on track r
d) Else, delay the execution of ji
3. Minimizing Staleness
We call an algorithm eager, or work-conserving, if it leaves
no processor idle while at least one pending update exists.
We first state the rather-inscrutable Theorem 3.1, followed
by an easy-to-read corollary, which implies that for any C <
(v3 - 1)/2, there is a constant (dependent on C) such that the
staleness of any eager algorithm is at most that constant
factor times optimal, provided that each ai is at most Cp/t.
Theorem 3.1. Fix p, t. For any and d such that 0 <, d < 1,
define C, d = vd (1 - )/v3 > 0. Given p processors and t
tables, pick any a such that a/ [1 - a/ (p/t)] = C, d p/t. Then
the penalty incurred by an eager algorithm is at most (1 + a)
2(1/ 4) (1/ (1 - d)) times LOW, provided that each ai = a.
Since LOW is a lower bound on the staleness achieved by
any algorithm, even the optimal, prescient one, and penalty
is an upper bound on the staleness achieved by any eager
algorithm, the corollary implies the claimed competitiveness
Proof: B be the set of batches in this run. For some batch Bi
B, let ci be the length of the first update, di be the wait time,
and bi be the total length of the batch, i.e., the sum of the
lengths of its updates. Clearly,
ci = bi = ci + di, (1)
since ci = bi is obvious and since ci + di is the duration in
time from the start (not release) time of the first job in the
batch till the update for the batch starts, and this duration is
clearly at least the length bi of the batch. For the penalty of
this batch, denoted by i, we take the square of the own time,
i.e., the length ci of the first update plus the wait time di
plus the processing time of the entire batch:
i= [(ci+di) +abi] 2, by the definition of penalty (2)
= (1+a) 2(ci+di) 2, by (1). (3)
Figure 2 illustrates the quantities bi, ci and di using the same
example as that in Figure 1; in particular, we consider the
batch consisting of updates arriving at times ri, 1 and ri, 2.
Let A be the set of all updates. From the definition of
LOW, each update i A has a budget of a2 units, where ai
is the length of update i. Our proof requires the use of a
charging scheme. A charging scheme specifies what
fraction of its budget each update pays to a certain batch.
Let us call a batch Bi tardy if ci < (ci + di) (where comes
from Theorem 3.1); otherwise it is punctual. Let us denote
the corresponding sets by Bt and Bp respectively. More
formally, a charging scheme is a matrix (vij) of nonnegative
values, where vij shows the extent of dependence of batch i
on the budget available to batch j, with the following two
properties.
Paper ID: 21091302 376
International Journal of Science and Research (IJSR), India Online ISSN: 2319-7064
Volume 2 Issue 9, September 2013
www.ijsr.net
Figure 2: A plot of the staleness of table i over time
4. Conclusion
In this paper, we studied the complexity of scheduling data-
loading jobs to minimize the staleness of a real time stream
warehouse. We proved that any on-line non-preemptive
algorithm that is never voluntarily idle achieves a constant
competitive ratio with respect to the total staleness of all
tables in the warehouse, provided that the processors are
sufficiently fast.
We solved the problem of scheduling updates in a real-time
streaming warehouse. We projected the notion of averages
staleness as a scheduling metric and presented scheduling
algorithms designed to handle complex environment of a
streaming data warehouse. We then proposed a scheduling
framework that assigns jobs to processing tracks and also
uses the basic algorithms to schedule jobs within a same.
The main feature of our framework is the ability to reserve
resources for short jobs that dften correspond to important
frequently refreshed tables while avoiding the inefficiencies
associated with partitioned scheduling techniques. Feature
work is needed for choosing the right scheduling
granularity when it is more efficient to update multiple
tables together.
References
[1] L. Golab, T. Johnson, J. S. Seidel and V. Shkapenyuk,
Stream Warehousing with Data Depot, SIGMOD 2009,
847-854.
[2] L. Golab, T. Johnson, and V. Shkapenyuk, Scheduling
Updates in a Real Time Stream Warehouse, ICDE 2009,
1207-1210.
[3] W. Labio, R. Yerneni, and H. Garcia-Molina, Shrinking
the Warehouse Update Window, SIGMOD1999, 383-
394.
[4] N. Polyzotis, S. Skiadopoulos, P. Vassiliadis, A. Simits
is, and N.-E. Frantzell, Supporting Streaming Updates in
an Active Data Warehouse, ICDE 2007, 476-485.
[5] Scalable Scheduling of Updates in Streaming Data
Warehouses Lukasz Golab, Theodore Johnson and
Vladislav Shkapenyuk AT&T Labs – Research, Florham
Park, NJ, 0793.
[6] N. Folkert, A. Gupta, A. Witkowski, S. Subramanian, S.
Bel lamkonda, S. Shankar, T. Bozkaya, and L. Sheng,
optimizing refresh of a set of materialized views, VLD
B 2005, 1043- 1054.
[7] L. Golab, T. Johnson, J. S. Seidel, and V. Shkapenyuk
Stream warehousing with Data Depot, SIGMOD 2009,
847-854.
Paper ID: 21091302 377

More Related Content

PDF
[IJET V2I2P18] Authors: Roopa G Yeklaspur, Dr.Yerriswamy.T
IJET - International Journal of Engineering and Techniques
 
PDF
Scalable scheduling of updates in streaming data warehouses
IRJET Journal
 
PDF
Presentation southernstork 2009-nov-southernworkshop
balmanme
 
PDF
Fault tolerance on cloud computing
www.pixelsolutionbd.com
 
PDF
IRJET- Enhance Dynamic Heterogeneous Shortest Job first (DHSJF): A Task Schedu...
IRJET Journal
 
PDF
Survey of streaming data warehouse update scheduling
eSAT Journals
 
PDF
Bounded ant colony algorithm for task Allocation on a network of homogeneous ...
ijcsit
 
PDF
SummerStudentReport-HamzaZafar
Hamza Zafar
 
[IJET V2I2P18] Authors: Roopa G Yeklaspur, Dr.Yerriswamy.T
IJET - International Journal of Engineering and Techniques
 
Scalable scheduling of updates in streaming data warehouses
IRJET Journal
 
Presentation southernstork 2009-nov-southernworkshop
balmanme
 
Fault tolerance on cloud computing
www.pixelsolutionbd.com
 
IRJET- Enhance Dynamic Heterogeneous Shortest Job first (DHSJF): A Task Schedu...
IRJET Journal
 
Survey of streaming data warehouse update scheduling
eSAT Journals
 
Bounded ant colony algorithm for task Allocation on a network of homogeneous ...
ijcsit
 
SummerStudentReport-HamzaZafar
Hamza Zafar
 

What's hot (19)

PPTX
Distributed System Management
Ibrahim Amer
 
PDF
An Adaptive Load Sharing Algorithm for Heterogeneous Distributed System
IJORCS
 
PDF
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 
PDF
Balman dissertation Copyright @ 2010 Mehmet Balman
balmanme
 
PDF
E01113138
IOSR Journals
 
PDF
Efficient Resource Management Mechanism with Fault Tolerant Model for Computa...
Editor IJCATR
 
PDF
Enhancing Performance and Fault Tolerance of Hadoop Cluster
IRJET Journal
 
PDF
Comparative Analysis of Various Grid Based Scheduling Algorithms
iosrjce
 
PDF
Continental division of load and balanced ant
IJCI JOURNAL
 
PDF
Job Resource Ratio Based Priority Driven Scheduling in Cloud Computing
ijsrd.com
 
PDF
A Heterogeneous Static Hierarchical Expected Completion Time Based Scheduling...
IRJET Journal
 
PDF
Chapter 4: Parallel Programming Languages
Heman Pathak
 
PPTX
Optimization of Continuous Queries in Federated Database and Stream Processin...
Zbigniew Jerzak
 
PDF
J0210053057
researchinventy
 
PDF
A survey of various scheduling algorithm in cloud computing environment
eSAT Publishing House
 
PDF
A study of load distribution algorithms in distributed scheduling
eSAT Publishing House
 
PDF
Genetic Algorithm for Process Scheduling
Login Technoligies
 
Distributed System Management
Ibrahim Amer
 
An Adaptive Load Sharing Algorithm for Heterogeneous Distributed System
IJORCS
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 
Balman dissertation Copyright @ 2010 Mehmet Balman
balmanme
 
E01113138
IOSR Journals
 
Efficient Resource Management Mechanism with Fault Tolerant Model for Computa...
Editor IJCATR
 
Enhancing Performance and Fault Tolerance of Hadoop Cluster
IRJET Journal
 
Comparative Analysis of Various Grid Based Scheduling Algorithms
iosrjce
 
Continental division of load and balanced ant
IJCI JOURNAL
 
Job Resource Ratio Based Priority Driven Scheduling in Cloud Computing
ijsrd.com
 
A Heterogeneous Static Hierarchical Expected Completion Time Based Scheduling...
IRJET Journal
 
Chapter 4: Parallel Programming Languages
Heman Pathak
 
Optimization of Continuous Queries in Federated Database and Stream Processin...
Zbigniew Jerzak
 
J0210053057
researchinventy
 
A survey of various scheduling algorithm in cloud computing environment
eSAT Publishing House
 
A study of load distribution algorithms in distributed scheduling
eSAT Publishing House
 
Genetic Algorithm for Process Scheduling
Login Technoligies
 
Ad

Viewers also liked (7)

PDF
Lake Water Environment Capacity Analysis Based on Steady-State Model
International Journal of Science and Research (IJSR)
 
PDF
Impacts of Agricultural Activities on Water Quality in the Dufuya Dambos, Low...
International Journal of Science and Research (IJSR)
 
PDF
Design of Remote Video Monitoring and Motion Detection System based on Arm-Li...
International Journal of Science and Research (IJSR)
 
PDF
Detection of Cysts in Ultrasonic Images of Ovary
International Journal of Science and Research (IJSR)
 
PDF
Providing Accident Detection in Vehicular Networks through OBD-II Devices and...
International Journal of Science and Research (IJSR)
 
PDF
Voice Morphing System for People Suffering from Laryngectomy
International Journal of Science and Research (IJSR)
 
PDF
Radiochemical Properties of Irradiated PVA\AgNO3 Film by Electron Beam
International Journal of Science and Research (IJSR)
 
Lake Water Environment Capacity Analysis Based on Steady-State Model
International Journal of Science and Research (IJSR)
 
Impacts of Agricultural Activities on Water Quality in the Dufuya Dambos, Low...
International Journal of Science and Research (IJSR)
 
Design of Remote Video Monitoring and Motion Detection System based on Arm-Li...
International Journal of Science and Research (IJSR)
 
Detection of Cysts in Ultrasonic Images of Ovary
International Journal of Science and Research (IJSR)
 
Providing Accident Detection in Vehicular Networks through OBD-II Devices and...
International Journal of Science and Research (IJSR)
 
Voice Morphing System for People Suffering from Laryngectomy
International Journal of Science and Research (IJSR)
 
Radiochemical Properties of Irradiated PVA\AgNO3 Film by Electron Beam
International Journal of Science and Research (IJSR)
 
Ad

Similar to Minimize Staleness and Stretch in Streaming Data Warehouses (20)

PPT
REAL TIME PROJECTS IEEE BASED PROJECTS EMBEDDED SYSTEMS PAPER PUBLICATIONS M...
Finalyear Projects
 
PPT
Scalable scheduling of updates in streaming data warehouses
Finalyear Projects
 
PDF
High Dimensionality Structures Selection for Efficient Economic Big data usin...
IRJET Journal
 
PDF
capacityshifting1
Gokul Vasan
 
PDF
Updating and Scheduling of Streaming Web Services in Data Warehouses
International Journal of Science and Research (IJSR)
 
PDF
A case study on Machine scheduling and sequencing using Meta heuristics
IJERA Editor
 
PDF
A case study on Machine scheduling and sequencing using Meta heuristics
IJERA Editor
 
PPTX
Scheduling algorithm in real time system
VishalPandat2
 
PDF
International Journal of Engineering and Science Invention (IJESI)
inventionjournals
 
PDF
Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...
confluent
 
PDF
Problems in Task Scheduling in Multiprocessor System
ijtsrd
 
PDF
A survey on the performance of job scheduling in workflow application
iaemedu
 
PDF
Distributed Feature Selection for Efficient Economic Big Data Analysis
IRJET Journal
 
PDF
Real Time most famous algorithms
Andrea Tino
 
PPTX
Data Streaming (in a Nutshell) ... and Spark's window operations
Vincenzo Gulisano
 
PDF
IRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET Journal
 
PDF
Design And Analysis Of Algorithms Lecture Notes Mit 6046j Itebooks
arkosirubek44
 
PDF
SCHEDULING DIFFERENT CUSTOMER ACTIVITIES WITH SENSING DEVICE
ijait
 
PDF
Efficient Cost Minimization for Big Data Processing
IRJET Journal
 
PDF
PV 2014 - Montali - Verification of Parameterized Data-Aware Dynamic Systems
Faculty of Computer Science - Free University of Bozen-Bolzano
 
REAL TIME PROJECTS IEEE BASED PROJECTS EMBEDDED SYSTEMS PAPER PUBLICATIONS M...
Finalyear Projects
 
Scalable scheduling of updates in streaming data warehouses
Finalyear Projects
 
High Dimensionality Structures Selection for Efficient Economic Big data usin...
IRJET Journal
 
capacityshifting1
Gokul Vasan
 
Updating and Scheduling of Streaming Web Services in Data Warehouses
International Journal of Science and Research (IJSR)
 
A case study on Machine scheduling and sequencing using Meta heuristics
IJERA Editor
 
A case study on Machine scheduling and sequencing using Meta heuristics
IJERA Editor
 
Scheduling algorithm in real time system
VishalPandat2
 
International Journal of Engineering and Science Invention (IJESI)
inventionjournals
 
Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...
confluent
 
Problems in Task Scheduling in Multiprocessor System
ijtsrd
 
A survey on the performance of job scheduling in workflow application
iaemedu
 
Distributed Feature Selection for Efficient Economic Big Data Analysis
IRJET Journal
 
Real Time most famous algorithms
Andrea Tino
 
Data Streaming (in a Nutshell) ... and Spark's window operations
Vincenzo Gulisano
 
IRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET Journal
 
Design And Analysis Of Algorithms Lecture Notes Mit 6046j Itebooks
arkosirubek44
 
SCHEDULING DIFFERENT CUSTOMER ACTIVITIES WITH SENSING DEVICE
ijait
 
Efficient Cost Minimization for Big Data Processing
IRJET Journal
 
PV 2014 - Montali - Verification of Parameterized Data-Aware Dynamic Systems
Faculty of Computer Science - Free University of Bozen-Bolzano
 

More from International Journal of Science and Research (IJSR) (20)

PDF
Innovations in the Diagnosis and Treatment of Chronic Heart Failure
International Journal of Science and Research (IJSR)
 
PDF
Design and implementation of carrier based sinusoidal pwm (bipolar) inverter
International Journal of Science and Research (IJSR)
 
PDF
Polarization effect of antireflection coating for soi material system
International Journal of Science and Research (IJSR)
 
PDF
Image resolution enhancement via multi surface fitting
International Journal of Science and Research (IJSR)
 
PDF
Ad hoc networks technical issues on radio links security &amp; qo s
International Journal of Science and Research (IJSR)
 
PDF
Microstructure analysis of the carbon nano tubes aluminum composite with diff...
International Journal of Science and Research (IJSR)
 
PDF
Improving the life of lm13 using stainless spray ii coating for engine applic...
International Journal of Science and Research (IJSR)
 
PDF
An overview on development of aluminium metal matrix composites with hybrid r...
International Journal of Science and Research (IJSR)
 
PDF
Pesticide mineralization in water using silver nanoparticles incorporated on ...
International Journal of Science and Research (IJSR)
 
PDF
Comparative study on computers operated by eyes and brain
International Journal of Science and Research (IJSR)
 
PDF
T s eliot and the concept of literary tradition and the importance of allusions
International Journal of Science and Research (IJSR)
 
PDF
Effect of select yogasanas and pranayama practices on selected physiological ...
International Journal of Science and Research (IJSR)
 
PDF
Grid computing for load balancing strategies
International Journal of Science and Research (IJSR)
 
PDF
A new algorithm to improve the sharing of bandwidth
International Journal of Science and Research (IJSR)
 
PDF
Main physical causes of climate change and global warming a general overview
International Journal of Science and Research (IJSR)
 
PDF
Performance assessment of control loops
International Journal of Science and Research (IJSR)
 
PDF
Capital market in bangladesh an overview
International Journal of Science and Research (IJSR)
 
PDF
Faster and resourceful multi core web crawling
International Journal of Science and Research (IJSR)
 
PDF
Extended fuzzy c means clustering algorithm in segmentation of noisy images
International Journal of Science and Research (IJSR)
 
PDF
Parallel generators of pseudo random numbers with control of calculation errors
International Journal of Science and Research (IJSR)
 
Innovations in the Diagnosis and Treatment of Chronic Heart Failure
International Journal of Science and Research (IJSR)
 
Design and implementation of carrier based sinusoidal pwm (bipolar) inverter
International Journal of Science and Research (IJSR)
 
Polarization effect of antireflection coating for soi material system
International Journal of Science and Research (IJSR)
 
Image resolution enhancement via multi surface fitting
International Journal of Science and Research (IJSR)
 
Ad hoc networks technical issues on radio links security &amp; qo s
International Journal of Science and Research (IJSR)
 
Microstructure analysis of the carbon nano tubes aluminum composite with diff...
International Journal of Science and Research (IJSR)
 
Improving the life of lm13 using stainless spray ii coating for engine applic...
International Journal of Science and Research (IJSR)
 
An overview on development of aluminium metal matrix composites with hybrid r...
International Journal of Science and Research (IJSR)
 
Pesticide mineralization in water using silver nanoparticles incorporated on ...
International Journal of Science and Research (IJSR)
 
Comparative study on computers operated by eyes and brain
International Journal of Science and Research (IJSR)
 
T s eliot and the concept of literary tradition and the importance of allusions
International Journal of Science and Research (IJSR)
 
Effect of select yogasanas and pranayama practices on selected physiological ...
International Journal of Science and Research (IJSR)
 
Grid computing for load balancing strategies
International Journal of Science and Research (IJSR)
 
A new algorithm to improve the sharing of bandwidth
International Journal of Science and Research (IJSR)
 
Main physical causes of climate change and global warming a general overview
International Journal of Science and Research (IJSR)
 
Performance assessment of control loops
International Journal of Science and Research (IJSR)
 
Capital market in bangladesh an overview
International Journal of Science and Research (IJSR)
 
Faster and resourceful multi core web crawling
International Journal of Science and Research (IJSR)
 
Extended fuzzy c means clustering algorithm in segmentation of noisy images
International Journal of Science and Research (IJSR)
 
Parallel generators of pseudo random numbers with control of calculation errors
International Journal of Science and Research (IJSR)
 

Recently uploaded (20)

PDF
2.Reshaping-Indias-Political-Map.ppt/pdf/8th class social science Exploring S...
Sandeep Swamy
 
PPTX
Kanban Cards _ Mass Action in Odoo 18.2 - Odoo Slides
Celine George
 
PDF
BÀI TẬP TEST BỔ TRỢ THEO TỪNG CHỦ ĐỀ CỦA TỪNG UNIT KÈM BÀI TẬP NGHE - TIẾNG A...
Nguyen Thanh Tu Collection
 
DOCX
Action Plan_ARAL PROGRAM_ STAND ALONE SHS.docx
Levenmartlacuna1
 
PDF
The Minister of Tourism, Culture and Creative Arts, Abla Dzifa Gomashie has e...
nservice241
 
PPTX
Artificial-Intelligence-in-Drug-Discovery by R D Jawarkar.pptx
Rahul Jawarkar
 
PPTX
TEF & EA Bsc Nursing 5th sem.....BBBpptx
AneetaSharma15
 
PPTX
A Smarter Way to Think About Choosing a College
Cyndy McDonald
 
PPTX
Five Point Someone – Chetan Bhagat | Book Summary & Analysis by Bhupesh Kushwaha
Bhupesh Kushwaha
 
PPTX
Introduction to pediatric nursing in 5th Sem..pptx
AneetaSharma15
 
PPTX
How to Apply for a Job From Odoo 18 Website
Celine George
 
PDF
Review of Related Literature & Studies.pdf
Thelma Villaflores
 
PDF
Virat Kohli- the Pride of Indian cricket
kushpar147
 
PPTX
CARE OF UNCONSCIOUS PATIENTS .pptx
AneetaSharma15
 
PPTX
How to Track Skills & Contracts Using Odoo 18 Employee
Celine George
 
PPTX
An introduction to Prepositions for beginners.pptx
drsiddhantnagine
 
PPTX
20250924 Navigating the Future: How to tell the difference between an emergen...
McGuinness Institute
 
PPTX
Tips Management in Odoo 18 POS - Odoo Slides
Celine George
 
PDF
Module 2: Public Health History [Tutorial Slides]
JonathanHallett4
 
DOCX
SAROCES Action-Plan FOR ARAL PROGRAM IN DEPED
Levenmartlacuna1
 
2.Reshaping-Indias-Political-Map.ppt/pdf/8th class social science Exploring S...
Sandeep Swamy
 
Kanban Cards _ Mass Action in Odoo 18.2 - Odoo Slides
Celine George
 
BÀI TẬP TEST BỔ TRỢ THEO TỪNG CHỦ ĐỀ CỦA TỪNG UNIT KÈM BÀI TẬP NGHE - TIẾNG A...
Nguyen Thanh Tu Collection
 
Action Plan_ARAL PROGRAM_ STAND ALONE SHS.docx
Levenmartlacuna1
 
The Minister of Tourism, Culture and Creative Arts, Abla Dzifa Gomashie has e...
nservice241
 
Artificial-Intelligence-in-Drug-Discovery by R D Jawarkar.pptx
Rahul Jawarkar
 
TEF & EA Bsc Nursing 5th sem.....BBBpptx
AneetaSharma15
 
A Smarter Way to Think About Choosing a College
Cyndy McDonald
 
Five Point Someone – Chetan Bhagat | Book Summary & Analysis by Bhupesh Kushwaha
Bhupesh Kushwaha
 
Introduction to pediatric nursing in 5th Sem..pptx
AneetaSharma15
 
How to Apply for a Job From Odoo 18 Website
Celine George
 
Review of Related Literature & Studies.pdf
Thelma Villaflores
 
Virat Kohli- the Pride of Indian cricket
kushpar147
 
CARE OF UNCONSCIOUS PATIENTS .pptx
AneetaSharma15
 
How to Track Skills & Contracts Using Odoo 18 Employee
Celine George
 
An introduction to Prepositions for beginners.pptx
drsiddhantnagine
 
20250924 Navigating the Future: How to tell the difference between an emergen...
McGuinness Institute
 
Tips Management in Odoo 18 POS - Odoo Slides
Celine George
 
Module 2: Public Health History [Tutorial Slides]
JonathanHallett4
 
SAROCES Action-Plan FOR ARAL PROGRAM IN DEPED
Levenmartlacuna1
 

Minimize Staleness and Stretch in Streaming Data Warehouses

  • 1. International Journal of Science and Research (IJSR), India Online ISSN: 2319-7064 Volume 2 Issue 9, September 2013 www.ijsr.net Minimize Staleness and Stretch in Streaming Data Warehouses S. M. Subhani1 , M. Nagendramma2 1, 2 Department of CSE, BVSREC, Chimakurthy, A.P, India Abstract: We study scheduling algorithms for loading data feeds into real time data warehouses, which are used in applications such as IP network monitoring, online financial trading, and credit card fraud detection. In these applications, the warehouse collects a large number of streaming data feeds that are generated by external sources and arrive asynchronously. We discuss update scheduling in streaming data warehouses, which combine the features of traditional data warehouses and data stream systems. In our setting, external sources push append-only data streams into the warehouse with a wide range of inter-arrival times. While traditional data warehouses are typically refreshed during downtimes, streaming warehouses are updated as new data arrive. In this paper we develop a theory of temporal consistency for stream warehouses that allows for multiple consistency levels. We model the streaming warehouse update problem as a scheduling problem, where jobs correspond to processes that load new data into tables, and whose objective is to minimize data staleness over time. Keywords: Data warehouse maintenance, online scheduling 1. Introduction The goal of a streaming warehouse is to propagate new data across all the relevant tables and views as quickly as possible. Once new data are loaded, the applications and triggers defined on the warehouse can take immediate action. This allows businesses to make decisions in nearly real time, which may lead to increased profits, improved customer satisfaction, and prevention of serious problems that could develop if no action was taken. Data warehouses integrate information from multiple operational databases to enable complex business analyses. In traditional applications, warehouses are updated periodically and data analysis is done off-line [3]. In contrast, real time warehouses [1], also known as active warehouses [4], continually load incoming data feeds to support time-critical analyses. For instance, an Internet Service Provider (ISP) may collect streams of network configuration and performance data generated by remote sources in nearly real time. New data must be loaded in a timely manner and correlated against historical data to quickly identify network anomalies, denial-of-service attacks, and inconsistencies among protocol layers. Similarly, online stock trading applications may discover profit opportunities by comparing recent transactions in nearly real time against historical trends. Banks may be interested in analyzing incoming streams of credit card transactions to protect customers against identity theft. Since the effectiveness of a real time warehouse depends on its ability to ingest new data, we study problems related to data staleness. In our setting, each table in the warehouse collects data from an external source. The arrival of a set of new data releases an update that seeks to append the data to the corresponding table. Since existing data are not modified, the processing time of an update is at most proportional to the amount of new data. Our first objective is to nonpreemptively1 schedule the updates on one or more processors in a way that minimizes the total staleness of all tables. Our first contribution answers a question implicit in [2] regarding the difficulty of this problem. We prove that even in the purely online model, any on-line non preemptive algorithm achieves staleness at most a constant factor times optimal, provided that no processor is ever voluntarily idle and provided that the processors are sufficiently fast. 2. System Model 2.1Warehousing Architecture Figure 1 illustrates a streaming data warehouse. Each data stream is generated by an external source, with a batch of new data, consisting of one or more records being pushed to the warehouse with period pi. If the period of a stream is unknown or unpredictable, we let the user choose a period with which the warehouse should check for new data. Examples of streams collected by an Internet Service provider include router performance statistics such as CPU usage, system logs, routing table updates, link layer alerts etc.An important property of the data streams in our motivating applications is that they are append-only,i.e., existing records are never modified or deleted. For example, a stream of average router CPU utilization measurement may consist of records with fields (timestamp, router name, CPU utilization) and a new data file with updated CPU measurement for each router may arrive at the warehouse every five minutes. [5] Figure 1: Stream data warehouse A streaming data warehouse maintains two types of tables: base and derived. Each table may be stored partially or Paper ID: 21091302 375
  • 2. International Journal of Science and Research (IJSR), India Online ISSN: 2319-7064 Volume 2 Issue 9, September 2013 www.ijsr.net wholly on disk. A base table is loaded directly from a data stream. A derived table is a materialized view defined over one or more tables. Each base or derived table Tj has a user –defined priority pj and a time-dependent staleness function sj(τ) that will be defined shortly. Relationships among source and derived tables form a (directed and acyclic) dependency graph. For each table Tj, we define a set of its ancestor tables as those which directly or indirectly serve as its sources, and a set of its dependent tables as those which are directly or indirectly sourced from Tj. For example, T1, T2 and T3 are ancestors of T4, and T3 and T4 are dependents of T1.In practice, warehouse tables are horizontally partitioned by time so that only a small number of recent partitions are affected by updates [6][7]. 2.2 Earliest Deadline First (EDF) EDF has been proven to be an optimal uniprocessor scheduling algorithms .This means that if a set of tasks is unschedulable under EDF, then no other scheduling algorithm can feasible schedule this task set. The EDF algorithm chooses for execution at each instant in the time currently active job(s) that have the nearest deadlines. The EDF implementation upon uniform parallel machines is according to the following rules, No Processor is idled while there are active jobs waiting for execution, when fewer then m jobs are active, they are required to execute on the fastest processor while the slowest are idled, and higher priority jobs are executed on faster processors. In Earliest Deadline First scheduling, at every scheduling point the task having the shortest deadline is taken up for scheduling. A task is schedule under EDF, if and only if it satisfies the condition that total processor utilization (Ui) due to the task set is less than 1. The Aim of this work is to provide a sensitivity analysis for task deadline context of multiprocessor system by using a new approach of EFDF (Earliest Feasible Deadline First) algorithm. In order to decrease the number of migrations we prevent a job from moving one processor to another processor if it is among them higher priority jobs. Therefore, a job will continue its execution on the same processor if possible (processor affinity). The result of these comparisons outlines some situations where one scheme is preferable over the other. Partitioning schemes are better suited for hard real- time systems, while a global scheme is preferable for soft real-time systems. The final EDF – partitioned scheduling algorithm is following 1.Sort the released jobs by the local algorithm 2.For each job ji in sorted order a) If ji’s home track is available, schedule ji on its home track b) Else, if there is an available free track, schedule ji on the free track c) Else, scan through the tracks r such that ji can be promoted to track r i) If track r is free and there is no released job remaining in the sorted list for home track r, ii) Schedule ji on track r d) Else, delay the execution of ji 3. Minimizing Staleness We call an algorithm eager, or work-conserving, if it leaves no processor idle while at least one pending update exists. We first state the rather-inscrutable Theorem 3.1, followed by an easy-to-read corollary, which implies that for any C < (v3 - 1)/2, there is a constant (dependent on C) such that the staleness of any eager algorithm is at most that constant factor times optimal, provided that each ai is at most Cp/t. Theorem 3.1. Fix p, t. For any and d such that 0 <, d < 1, define C, d = vd (1 - )/v3 > 0. Given p processors and t tables, pick any a such that a/ [1 - a/ (p/t)] = C, d p/t. Then the penalty incurred by an eager algorithm is at most (1 + a) 2(1/ 4) (1/ (1 - d)) times LOW, provided that each ai = a. Since LOW is a lower bound on the staleness achieved by any algorithm, even the optimal, prescient one, and penalty is an upper bound on the staleness achieved by any eager algorithm, the corollary implies the claimed competitiveness Proof: B be the set of batches in this run. For some batch Bi B, let ci be the length of the first update, di be the wait time, and bi be the total length of the batch, i.e., the sum of the lengths of its updates. Clearly, ci = bi = ci + di, (1) since ci = bi is obvious and since ci + di is the duration in time from the start (not release) time of the first job in the batch till the update for the batch starts, and this duration is clearly at least the length bi of the batch. For the penalty of this batch, denoted by i, we take the square of the own time, i.e., the length ci of the first update plus the wait time di plus the processing time of the entire batch: i= [(ci+di) +abi] 2, by the definition of penalty (2) = (1+a) 2(ci+di) 2, by (1). (3) Figure 2 illustrates the quantities bi, ci and di using the same example as that in Figure 1; in particular, we consider the batch consisting of updates arriving at times ri, 1 and ri, 2. Let A be the set of all updates. From the definition of LOW, each update i A has a budget of a2 units, where ai is the length of update i. Our proof requires the use of a charging scheme. A charging scheme specifies what fraction of its budget each update pays to a certain batch. Let us call a batch Bi tardy if ci < (ci + di) (where comes from Theorem 3.1); otherwise it is punctual. Let us denote the corresponding sets by Bt and Bp respectively. More formally, a charging scheme is a matrix (vij) of nonnegative values, where vij shows the extent of dependence of batch i on the budget available to batch j, with the following two properties. Paper ID: 21091302 376
  • 3. International Journal of Science and Research (IJSR), India Online ISSN: 2319-7064 Volume 2 Issue 9, September 2013 www.ijsr.net Figure 2: A plot of the staleness of table i over time 4. Conclusion In this paper, we studied the complexity of scheduling data- loading jobs to minimize the staleness of a real time stream warehouse. We proved that any on-line non-preemptive algorithm that is never voluntarily idle achieves a constant competitive ratio with respect to the total staleness of all tables in the warehouse, provided that the processors are sufficiently fast. We solved the problem of scheduling updates in a real-time streaming warehouse. We projected the notion of averages staleness as a scheduling metric and presented scheduling algorithms designed to handle complex environment of a streaming data warehouse. We then proposed a scheduling framework that assigns jobs to processing tracks and also uses the basic algorithms to schedule jobs within a same. The main feature of our framework is the ability to reserve resources for short jobs that dften correspond to important frequently refreshed tables while avoiding the inefficiencies associated with partitioned scheduling techniques. Feature work is needed for choosing the right scheduling granularity when it is more efficient to update multiple tables together. References [1] L. Golab, T. Johnson, J. S. Seidel and V. Shkapenyuk, Stream Warehousing with Data Depot, SIGMOD 2009, 847-854. [2] L. Golab, T. Johnson, and V. Shkapenyuk, Scheduling Updates in a Real Time Stream Warehouse, ICDE 2009, 1207-1210. [3] W. Labio, R. Yerneni, and H. Garcia-Molina, Shrinking the Warehouse Update Window, SIGMOD1999, 383- 394. [4] N. Polyzotis, S. Skiadopoulos, P. Vassiliadis, A. Simits is, and N.-E. Frantzell, Supporting Streaming Updates in an Active Data Warehouse, ICDE 2007, 476-485. [5] Scalable Scheduling of Updates in Streaming Data Warehouses Lukasz Golab, Theodore Johnson and Vladislav Shkapenyuk AT&T Labs – Research, Florham Park, NJ, 0793. [6] N. Folkert, A. Gupta, A. Witkowski, S. Subramanian, S. Bel lamkonda, S. Shankar, T. Bozkaya, and L. Sheng, optimizing refresh of a set of materialized views, VLD B 2005, 1043- 1054. [7] L. Golab, T. Johnson, J. S. Seidel, and V. Shkapenyuk Stream warehousing with Data Depot, SIGMOD 2009, 847-854. Paper ID: 21091302 377