Dissertation

2016

Degree Name

Doctor of Philosophy

Department

Electrical and Computer Engineering

Major

Computer Engineering

Arun K. Somani

Abstract

Big data continues to grow in size for all sciences. New methods like those proposed are needed to further reduce memory footprints and distribute work equally across compute nodes both in local HPC systems and rented cluster resources in the cloud. Modern infrastructures have evolved to support these big data computations and that includes key pieces like our internet backbones and data center networks. Many optical networks face heterogeneous communication requests requiring topologies to be efficient and fault tolerant. The all-pairs problem requires all elements (computation datasets or communication nodes) to be paired with all other elements. These all-pairs problems occur in many research fields and have significant impacts, which has led to their continued interest.

We proposed using cyclic quorum sets to efficiently manage all-pairs computations. We proved these sets have an "all-pairs" property that allows for minimal data replication and for distributed, load balanced, and communication-less computation management. The quorums are $O\left(\frac{N}{\sqrt{P}}\right)$ in size, up to 50% smaller than dual $\frac{N}{\sqrt{P}}$ array implementations, and significantly smaller than solutions requiring all data. Scaling from 16 to 512 cores (1 to 32 compute nodes) and using real dataset inputs, application experiments demonstrated scalability with greater than 150x (super-linear) speedup and less than 1/4th the memory usage per process.

Cyclic quorum sets can provided benefits to more than just computations. The sets can also provide a guarantee that all pairs of optical nodes in a network can communicate. Our evaluation analyzed the fault tolerance of routing optical cycles based on cyclic quorum sets. With this method of topology construction, unicast and multicast communication requests do not need to be known or even modeled a priori. In the presence of network single-link faults, our simulated cycle routing had greater than 99% average fault coverage. Hence, even in the presence of a network fault, the optical networks could continue operation of nearly all node pair communications.

Lastly, we proposed a generalized $R$ redundant cyclic quorum set. These sets guarantee all pairs of nodes occur at least $R$ times. When applied to routing cycles in optical networks, this technique provided almost fault-tolerant communications. More importantly, when applied using only single cycles rather than the standard paired cycles, the generalized $R$ redundancy technique almost halved the necessary light-trail resources while maintaining the fault tolerance and dependability expected from cycle-based routing.

\section*{Problem Description}

Big Data in recent years has become a focal point for science and commerce. As datasets grow larger, traditional methods and algorithms are challenged on whether they are able to truly scale. This has led to phrases like, "swimming in sensors, drowning in data."

Our work addresses some of the challenges facing a particular type of big data interaction. The interaction considered requires all elements in a set to interact with all other elements in the set.

The all-pairs interaction is a general computation or communication problem that occurs frequently and can be as simple as considering the shaking of hands by all attendees to a party. More formally there is set $E_N$, where there are $N$ elements indexed $0$ to $\left(N-1\right)$.

$E_N = \left\lbrace e_0, e_1, ... , e_{N-1} \right\rbrace$

The elements in this general formulation can be simple, single communication node or single item data structures, e.g., $E_N$ could simply be all nodes in a network or be a large array of $N$ values. Or, elements can be complex data structures with many fields / values. Fields are not restricted to a single data type either, as many big data problems can rely on heterogeneous datasets.

The all-pairs interaction considers all possible pairs of elements, $\binom{N}{2}$.

$\left\lbrace \left(e_0,e_1\right), \left(e_0,e_2\right), ... , \left(e_0,e_{N-1}\right), \left(e_1,e_2\right), \left(e_1,e_3\right) , ... , \left(e_1, e_{N-1}\right) , ... , \left(e_{N-2},e_{N-1}\right) \right\rbrace$

While the simple hand shake example could be considered a symmetric interaction.

$e_i \leftrightarrow e_j , i The all-pairs interaction can be more generally represented by two separate interactions to better represent the computational or communication complexity in those problems where the all-pairs operation is not commutative. \[ e_i \rightarrow e_j, i \[ e_i \leftarrow e_j, i The computational complexity of this general algorithmic form is not daunting. \[\binom{N}{2} = \frac{\left( N-1\right) N}{2} = O\left( N^2\right)$

In fact, even for pair computations that do not have the commutative property, the complexity is unchanged. In general, polynomial $O\left(N^2\right)$ computations are considered highly computationally scalable.

When performing an all-pairs data interaction on the big data scale sizes, while the computational complexity theoretically is manageable, the data management becomes complex. The problem definition inherently requires access to the entire dataset, such that every data element can be paired and processed with every other data element in the set. When the datasets exceed a system's memory size, this presents a challenge, which our methods address.

\section*{Solution Approach}

For efficiency and distributed control, it is common in distributed systems and algorithms to group nodes into intersecting sets referred to as quorum sets. Our management techniques rely on the established quorum set theories for their efficiencies and management. We then proved an "all-pairs" property of cyclic quorum sets, which is central to guaranteeing that all-pairs of elements (nodes or data) are able to interact in the system.

The all-pairs data computation problem requires all data elements to be paired with all other data elements. These all-pairs problems occur in many science fields, which has led to their continued interest. Our research addresses the memory and computation time challenges of the general all-pairs big data interaction computations through the use of memory efficient computation management techniques. Proposed were methods using distributed computing to share the computational workload. Although the problem definition requires every data element to have access to and interact with the entire dataset, our cyclic quorum set techniques relax this restriction in distributed systems. This computation management is used to reduce memory resource requirements per node and enable big data scalability. Implementation evaluation of a large bioinformatics application demonstrated scalability on real datasets with linear and at times super-linear speedups. Reductions in memory requirements per node allowed for processing larger datasets that would not have been feasible on a single node either due to memory or time requirements.

Similar cyclic quorum set techniques were used to address efficient and fault tolerant communication routing challenges in optical networking. Cycle-based optical network routing, whether using SONET rings or p-cycles, provide the sufficient reliability in networks. Light-trails forming a cycle in the network allow broadcasts within a cycle to be used for efficient multicast communications. Using the proven all-pairs'' property of cyclic quorum sets, we could guarantee all pairs of nodes will occur in one or more quorums, so efficient, arbitrary unicast communication can occur between any two nodes. Efficient broadcasts to all network nodes are possible by a node broadcasting to all quorum cycles to which it belongs ($O\left(\sqrt{N}\right)$.) We analyzed node pair communications in networks, specifically, the fault tolerance aspects of using cyclic quorum sets to route cycles. Observed was better than 99% average single fault coverage and some node pair communications were protected by more than one cycle.

Exploiting this redundant node pair protections revealed even greater resource efficiencies. Common cycle routing techniques will use pairs of cycles to achieve both routing and fault-tolerance, which uses substantial resources and creates the potential for underutilization. Instead, when we intentionally designed cyclic quorum sets with $R$ redundant pairs of nodes and utilized the $R$ redundancy within the quorum cycles to replace the pair of cycles with just a single cycle, we saw network resource usage almost halved. Our analysis of several networks showed $R=2$ redundant single cycles had 96.60 - 99.37% single link fault coverage, while reducing resource usage by 42.9 - 47.18% on average. Increasing redundancy to $R=3$ redundant cycles maintained a 93.23 - 99.34% average fault coverage even with two simultaneous link faults and used 38.85 - 42.39% fewer resources on average.

DOI

https://doi.org/10.31274/etd-180810-4575

Cory James Kleinheksel

en

application/pdf

154 pages

COinS