Toward efficient online scheduling for large-scale distributed machine learning system
Date
Authors
Major Professor
Advisor
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Authors
Research Projects
Organizational Units
Computer Science—the theory, representation, processing, communication and use of information—is fundamentally transforming every aspect of human endeavor. The Department of Computer Science at Iowa State University advances computational and information sciences through; 1. educational and research programs within and beyond the university; 2. active engagement to help define national and international research, and 3. educational agendas, and sustained commitment to graduating leaders for academia, industry and government.
History
The Computer Science Department was officially established in 1969, with Robert Stewart serving as the founding Department Chair. Faculty were composed of joint appointments with Mathematics, Statistics, and Electrical Engineering. In 1969, the building which now houses the Computer Science department, then simply called the Computer Science building, was completed. Later it was named Atanasoff Hall. Throughout the 1980s to present, the department expanded and developed its teaching and research agendas to cover many areas of computing.
Dates of Existence
1969-present
Related Units
- College of Liberal Arts and Sciences (parent college)
Journal Issue
Is Version Of
Versions
Series
Department
Abstract
Thanks to the rise of machine learning (ML) and its vast applications, recent years have witnessed a rapid growth of large-scale distributed ML frameworks, which exploit the massive parallelism of computing clusters to expedite ML training jobs. However, the proliferation of large-scale distributed ML frameworks also introduces many unique technical challenges in computing system design and optimization. In a networked computing cluster that supports a large number of training jobs, a central question is how to design efficient scheduling algorithms to allocate workers and parameter servers across different machines to minimize the overall training time. Toward this end, in this paper, we develop an online scheduling algorithm that jointly optimizes resource allocation and locality decisions. Our main contributions are three-fold: i) We develop a new analytical model that considers both resource allocation and locality; ii) Based on an equivalent reformulation and close observations on the worker-parameter server locality configurations, we transform the problem into a mixed cover/packing integer program, which enables approximation algorithm design; iii) We propose a meticulously designed randomized rounding approximation algorithm and rigorously prove its performance.Collectively, our results contribute to a comprehensive and fundamental understanding of distributed ML system optimization and algorithm design.