Campus Units

Electrical and Computer Engineering

Document Type

Article

Publication Version

Published Version

Publication Date

12-2017

Journal or Book Title

Journal of Big Data

Volume

4

Issue

18

DOI

10.1186/s40537-017-0080-9

Abstract

The performance gap between compute and storage is fairly considerable. This results in a mismatch between the application needs from storage and what storage can deliver. The full potential of storage devices cannot be harnessed till all layers of I/O hierarchy function efficiently. Despite advanced optimizations applied across various layers along the odyssey of data access, the I/O stack still remains volatile. The problems associated due to the inefficiencies in data management get amplified in Big Data shared resource environments. The Linux OS (host) block layer is the most critical part of the I/O hierarchy, as it orchestrates the I/O requests from different applications to the underlying storage. Unfortunately, despite it’s significance, the block layer, essentially the block I/O scheduler, hasn’t evolved to meet the needs of Big Data. We have designed and developed two contention avoidance storage solutions, collectively known as “BID: Bulk I/O Dispatch” in the Linux block layer specifically to suit multitenant, multi-tasking shared Big Data environments. Hard disk drives (HDDs) form the backbone of data center storage. The data access time in HDDs is majorly governed by disk arm movements, which usually occurs when data is not accessed sequentially. Big Data applications exhibit evident sequentiality but due to the contentions amongst other I/O submitting applications, the I/O accesses get multiplexed which leads to higher disk arm movements. BID schemes aim to exploit the inherent I/O sequentiality of Big Data applications to improve the overall I/O completion time by reducing the avoidable disk arm movements. In the first part, we propose a dynamically adaptable block I/O scheduling scheme BID-HDD for disk based storage. BID-HDD tries to recreate the sequentiality in I/O access in order to provide performance isolation to each I/O submitting process. Through trace driven simulation based experiments with cloud emulating MapReduce benchmarks, we show the effectiveness of BID-HDD which results in 28–52% lesser time for all I/O requests than the best performing Linux disk schedulers. In the second part, we propose a hybrid scheme BID-Hybrid to exploit SCM’s (SSDs) superior random performance to further avoid contentions at disk based storage. BID-Hybrid is able to efficiently offload non-bulky interruptions from HDD request queue to SSD queue using BID-HDD for disk request processing and multi-q FIFO architecture for SSD. This results in performance gain of 6–23% for MapReduce workloads when compared to BID-HDD and 33–54% over best performing Linux scheduling scheme. BID schemes as a whole is aimed to avoid contentions for disk based storage I/Os following system constraints without compromising SLAs.

Comments

This article is published as Mishra, Pratik, and Arun K. Somani. "Host managed contention avoidance storage solutions for Big Data." Journal of Big Data 4, no. 1 (2017): 18. DOI: 10.1186/s40537-017-0080-9. Posted with permission.

Creative Commons License

Creative Commons Attribution 4.0 License
This work is licensed under a Creative Commons Attribution 4.0 License.

Copyright Owner

The Authors

Language

en

File Format

application/pdf

Share

COinS