A study of time-decayed aggregates computation on data streams

Thumbnail Image
Date
2009-01-01
Authors
Xu, Bojian
Major Professor
Advisor
Srikanta Tirthapura
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Altmetrics
Authors
Research Projects
Organizational Units
Journal Issue
Is Version Of
Versions
Series
Department
Electrical and Computer Engineering
Abstract

Many real world data naturally arrive as rapid paced and virtually unbounded streams. Examples of such streams include network traffic at a router, events observed by a sensor network, accesses to a web server and transactional updates to a large database. Such streaming data need to be monitored online to collect traffic statistics, detect trends and anomalies, tune system performance and help make business decisions. However, because of the large size and rapid pace of the data, as well as the online processing requirement, conventional data processing methods, such as storing the data in a database and issuing offline SQL queries thereafter, are not feasible. Data stream processing is a new diagram of massive data set processing and creates new challenges in the algorithm design and implementation.

In this thesis, we consider time-decayed data aggregation for data streams, where the importance or contribution of each data element decays over time, since recent data are usually considered of more importance in applications, and therefore are given heavier weights. We design small space data structures and algorithms for maintaining fundamental aggregates of the streams if it is possible and otherwise show large space lower bounds. We consider the data aggregation over a robust data stream model called asynchronous data stream, motivated by the streaming data transmitted in distributed systems, including computer networks, where the asynchrony in the data transmission is inevitable. In asynchronous data stream, the arrival order of the data elements at the receiver side is not necessarily the same as the order in which the data elements were generated. Asynchronous data stream is a robuster and generalized model of the previous synchronous data stream model.

In summary, this thesis presents the following results:

1. We formalize the model of asynchronous data stream and the notion of timestamp sliding window. We propose the first small space sketch for summarizing the data elements over timestamp sliding windows of multiple geographically distributed asynchronous data streams. The sketch can return accuracy guaranteed estimates for basic aggregates, such as: Sum, Median and Quantiles.

2. We design the first small space sketch for general purpose network streaming data aggregation. The sketch has the following properties that make it useful in communication-efficient aggregation in distributed streaming scenarios: (1) The sketch can handle multiple geographically distributed asynchronous data streams. (2) The sketch is duplicate-insensitive, i.e. reinsertions of the same data will not affect the sketch, and hence the estimates of aggregates. (3) The sketch is also time-decaying, so that the weight of each data element summarized in the sketch decreases over time. (4) The sketch returns accuracy guaranteed estimates for a variety of core aggregates, including the sum, median, quantiles, frequent elements and

selectivity.

3. We conduct a comprehensive study on the time-decayed correlated data aggregation over asynchronous data streams. For each class of time decay function, we either propose space efficient algorithms or show large space lower bounds. We not only closes the open problem of correlated data aggregation under sliding windows decay, but also presents negative results for the case of exponential decay, which however is highly used in the non-correlated scenarios.

4. We propose the forward decay model to simplify the time-decayed data stream aggregation and sampling. Forward decay captures a variety of usual decay functions (or called backward decay), such as exponential decay. We design efficient algorithms for data aggregation and sampling under the forward decay model, and show that they are easy to implement scalably.

Comments
Description
Keywords
Citation
Source
Copyright
Thu Jan 01 00:00:00 UTC 2009