Random sampling in Apache Hive

Parvathaneni, Sai Sree

Random sampling in Apache Hive

File

Parvathaneni_iastate_0097M_18278.pdf (1.18 MB)

Date

2019-01-01

Authors

Parvathaneni, Sai Sree

Advisor

Srikanta Tirthapura

Organizational Units

Organizational Unit

Electrical and Computer Engineering

Department

Electrical and Computer Engineering

Abstract

Data generated by humans and machines is growing at a rapid pace. Analyzing the data provides trends, patterns, and useful insights in data which helps to make important organizational decisions. Traditional database systems have been storing and analyzing large amounts of data for many decades. In traditional databases, handling and analyzing growing data needs lots of resources and time. Reading and writing large data from a single disk is signiﬁcantly slow. Storing and reading from multiple disks and combining them for analyzing on a single CPU is also not reasonable for huge amounts of data. The problem of storing and analyzing large amount of data is handled by Apache Hadoop. Apache Hadoop is a collection of open source big data software’s that can eﬃciently handle storing large amounts of data by dividing data into small blocks and replicates the data to handle system failures. Data is analyzed based on the concept of parallel computation. Hive is a data warehousing software that works on top of Hadoop ﬁle system. It has an Hive QL interface to execute queries, and are automatically converted into map reduce or tez or spark jobs. For aggregate queries like AVG, SUM, count e.t.c., and for analyzing trends in data, sampling gives good approximation about overall data. Analyzing sample population can be achieved with limited amount of resources. There are diﬀerent sampling techniques to draw sample from a population, and choice of sampling technique depends on type of analysis we perform to achieve the goal. In this thesis, we have investigated three techniques to perform random sampling on Hive: simple random sampling using sorting, Bernoulli’s sampling, and our algorithm random sampling using bucketing. Simple random sampling using sorting, and Bernoulli’s sampling goes through the whole data to perform sampling in Hive. This slows down the performance when the data is huge. To avoid whole table scan while performing simple random sampling, our algorithm uses bucketing in hive architecture to manage the data stored on Hadoop Distributed File System. Bucketing divides the whole data into specified number of small blocks. Data is divided into buckets based on a specified column in a table. Bucketing allows to select any bucket of required size without scanning the whole table. Limiting data scan and sorting fewer elements decreases the time taken to perform simple random sampling using bucketing. Our experiments shows random sampling using bucketing performs much faster than random sampling using sorting and Bernoulli sampling when the data sizes or sample sizes are large.

Copyright

Sun Dec 01 00:00:00 UTC 2019

Collections

Theses and Dissertations

Full item page