Identifying correlated heavy-hitters in a two-dimensional data stream

Lahiri, Bibudh; Mukherjee, Arko; Tirthapura, Srikanta; Tirthapura, Srikanta

Identifying correlated heavy-hitters in a two-dimensional data stream

File

2016_Tirthapura_IdentifyingCorrelated.pdf (228.47 KB)

Date

2016-07-01

Authors

Lahiri, Bibudh

Mukherjee, Arko

Tirthapura, Srikanta

Authors

Person

Tirthapura, Srikanta

Professor

Organizational Units

Organizational Unit

Electrical and Computer Engineering

The Department of Electrical and Computer Engineering (ECpE) contains two focuses. The focus on Electrical Engineering teaches students in the fields of control systems, electromagnetics and non-destructive evaluation, microelectronics, electric power & energy systems, and the like. The Computer Engineering focus teaches in the fields of software systems, embedded systems, networking, information security, computer architecture, etc.

History
The Department of Electrical Engineering was formed in 1909 from the division of the Department of Physics and Electrical Engineering. In 1985 its name changed to Department of Electrical Engineering and Computer Engineering. In 1995 it became the Department of Electrical and Computer Engineering.

Dates of Existence
1909-present

Historical Names

Department of Electrical Engineering (1909-1985)
Department of Electrical Engineering and Computer Engineering (1985-1995)

Related Units

College of Engineering (parent college)
Department of Physics and Electrical Engineering (predecessor)

Department

Electrical and Computer Engineering

Abstract

We consider online mining of correlated heavy-hitters (CHH) from a data stream. Given a stream of two-dimensional data, a correlated aggregate query first extracts a substream by applying a predicate along a primary dimension, and then computes an aggregate along a secondary dimension. Prior work on identifying heavy-hitters in streams has almost exclusively focused on identifying heavy-hitters on a single dimensional stream, and these yield little insight into the properties of heavy-hitters along other dimensions. In typical applications however, an analyst is interested not only in identifying heavy-hitters, but also in understanding further properties such as: what other items appear frequently along with a heavy-hitter, or what is the frequency distribution of items that appear along with the heavy-hitters. We consider queries of the following form: “In a stream S of (x, y) tuples, on the substream H of all x values that are heavy-hitters, maintain those y values that occur frequently with the x values in H”. We call this problem as CHH. We formulate an approximate formulation of CHH identification, and present an algorithm for tracking CHHs on a data stream. The algorithm is easy to implement and uses workspace much smaller than the stream itself. We present provable guarantees on the maximum error, as well as detailed experimental results that demonstrate the space-accuracy trade-off.

Comments

This is an Accepted Manuscript of an article published by Taylor & Francis as Lahiri, Bibudh, Arko Provo Mukherjee, and Srikanta Tirthapura. "Identifying correlated heavy-hitters in a two-dimensional data stream." Data Mining and Knowledge Discovery 30, no. 4 (2016): 797-818. Available online: https://doi.org/10.1007/s10618-015-0438-6. Posted with permission.

Copyright

Fri Jan 01 00:00:00 UTC 2016

Collections

Publications

Full item page