Degree Type

Thesis

Date of Award

2019

Degree Name

Master of Science

Department

Computer Science

Major

Computer Science

First Advisor

Jia Liu

Abstract

Thanks to the rise of machine learning (ML) and its vast applications, recent years have witnessed a rapid growth of large-scale distributed ML frameworks, which exploit the massive parallelism of computing clusters to expedite ML training jobs. However, the proliferation of large-scale distributed ML frameworks also introduces many unique technical challenges in computing system design and optimization. In a networked computing cluster that supports a large number of training jobs, a central question is how to design efficient scheduling algorithms to allocate workers and parameter servers across different machines to minimize the overall training time. Toward this end, in this paper, we develop an online scheduling algorithm that jointly optimizes resource allocation and locality decisions. Our main contributions are three-fold: i) We develop a new analytical model that considers both resource allocation and locality; ii) Based on an equivalent reformulation and close observations on the worker-parameter server locality configurations, we transform the problem into a mixed cover/packing integer program, which enables approximation algorithm design; iii) We propose a meticulously designed randomized rounding approximation algorithm and rigorously prove its performance.Collectively, our results contribute to a comprehensive and fundamental understanding of distributed ML system optimization and algorithm design.

Copyright Owner

Menglu Yu

Language

en

File Format

application/pdf

File Size

39 pages

Available for download on Sunday, January 24, 2021

Share

COinS