Campus Units

Electrical and Computer Engineering, Computer Science

Document Type

Conference Proceeding

Conference

22nd International Conference on Extending Database Technology (EDBT)

Publication Version

Published Version

Publication Date

2019

Journal or Book Title

Proceedings of the 22nd International Conference on Extending Database Technology (EDBT)

Conference Title

22nd International Conference on Extending Database Technology (EDBT)

Conference Date

March 26-29, 2019

City

Lisbon, Portugal

Abstract

Stratified random sampling (SRS) is a widely used sampling technique for approximate query processing. We consider SRS on continuously arriving data streams, and make the following contributions. We present a lower bound that shows that any streaming algorithm for SRS must have (in the worst case) a variance that is Ω(r ) factor away from the optimal, where r is the number of strata. We present S-VOILA, a streaming algorithm for SRS that is locally variance-optimal. Results from experiments on real and synthetic data show that S-VOILA results in a variance that is typically close to an optimal offline algorithm, which was given the entire input beforehand. We also present a variance-optimal offline algorithm VOILA for stratified random sampling. VOILA is a strict generalization of the well-known Neyman allocation, which is optimal only under the assumption that each stratum is abundant, i.e. has a large number of data points to choose from. Experiments show that VOILA can have significantly smaller variance (1.4x to 50x) than Neyman allocation on real-world data.

Comments

This proceeding is published as Nguyen, Trong Duc, Ming-Hung Shih, Divesh Srivastava, Srikanta Tirthapura, and Bojian Xu. "Stratified Random Sampling from Streaming and Stored Data." Proceedings of the 22nd International Conference on Extending Database Technology (EDBT). Lisbon, Portugal: EDBT/ICDT 2019 Joint Conference. March 26-29, 2019. Posted with permission.

Creative Commons License

Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.

Copyright Owner

The Authors

Language

en

File Format

application/pdf

Share

Article Location

 
COinS