Range‐Efficient Counting of Distinct Elements in a Massive Data Stream

Thumbnail Image
Date
2007-05-01
Authors
Pavan, A.
Tirthapura, Srikanta
Major Professor
Advisor
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Authors
Person
Research Projects
Organizational Units
Organizational Unit
Organizational Unit
Journal Issue
Is Version Of
Versions
Series
Department
Computer ScienceElectrical and Computer Engineering
Abstract

Efficient one‐pass estimation of $F_0$, the number of distinct elements in a data stream, is a fundamental problem arising in various contexts in databases and networking. We consider range‐efficient estimation of $F_0$: estimation of the number of distinct elements in a data stream where each element of the stream is not just a single integer but an interval of integers. We present a randomized algorithm which yields an (ε, δ)‐approximation of $F_0$, with the following time and space complexities (n is the size of the universe of the items): (1) The amortized processing time per interval is $O(\log{\frac{1}{\delta}}\log \frac{n}{\epsilon})$. (2) The workspace used is $O(\frac{1}{\epsilon^2}\log{\frac{1}{\delta}}\log n)$ bits. Our algorithm improves upon a previous algorithm by Bar‐Yossef, Kumar and Sivakumar [Proceedings of the $13$th ACM–SIAM Symposium on Discrete Algorithms (SODA), 2002, pp. 623–632], which requires $O(\frac{1}{\epsilon^5} \log{\frac{1}{\delta}}\log^5 n)$ processing time per item. This algorithm can also be used to compute the max‐dominance norm of a stream of multiple signals and significantly improves upon the previous best time and space bounds by Cormode and Muthukrishnan [Proceedings of the $11$th European Symposium on Algorithms (ESA), Lecture Notes in Comput. Sci. 2938, Springer, Berlin, 2003, pp. 148–160]. This algorithm also provides an efficient solution to the distinct summation problem, which arises during data aggregation in sensor networks [Proceedings of the 2nd International Conference on Embedded Networked Sensor Systems, ACM Press, New York, 2004, pp. 250–262, Proceedings of the $20$th International Conference on Data Engineering (ICDE), 2004, pp. 449–460].

Comments

This article is from SIAM Journal on Computing 37 (2007): 359, doi: 10.1137/050643672. Posted with permission.

Description
Keywords
Citation
DOI
Copyright
Mon Jan 01 00:00:00 UTC 2007
Collections