Bringing ultra-large-scale software repository mining to the masses with Boa
Date
Authors
Major Professor
Advisor
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Altmetrics
Authors
Research Projects
Organizational Units
Computer Science—the theory, representation, processing, communication and use of information—is fundamentally transforming every aspect of human endeavor. The Department of Computer Science at Iowa State University advances computational and information sciences through; 1. educational and research programs within and beyond the university; 2. active engagement to help define national and international research, and 3. educational agendas, and sustained commitment to graduating leaders for academia, industry and government.
History
The Computer Science Department was officially established in 1969, with Robert Stewart serving as the founding Department Chair. Faculty were composed of joint appointments with Mathematics, Statistics, and Electrical Engineering. In 1969, the building which now houses the Computer Science department, then simply called the Computer Science building, was completed. Later it was named Atanasoff Hall. Throughout the 1980s to present, the department expanded and developed its teaching and research agendas to cover many areas of computing.
Dates of Existence
1969-present
Related Units
- College of Liberal Arts and Sciences (parent college)
Journal Issue
Is Version Of
Versions
Series
Department
Abstract
Mining software repositories provides developers and researchers a
chance to learn from previous development activities and apply that
knowledge to the future. Ultra-large-scale open source repositories
(e.g., SourceForge with 350,000+ projects, GitHub with 250,000+
projects, and Google Code with 250,000+ projects) provide an extremely
large corpus to perform such mining tasks on. This large corpus allows
researchers the opportunity to test new mining techniques and
empirically validate new approaches on real-world data. However, the
barrier to entry is often extremely high. Researchers interested in
mining must know a large number of techniques, languages, tools, etc,
each of which is often complex. Additionally, performing mining at
the scale proposed above adds additional complexity and often is
difficult to achieve.
The Boa language and infrastructure was developed to solve these
problems. We provide users a domain-specific language tailored for
software repository mining and allow them to submit queries via our
web-based interface. These queries are then automatically
parallelized and executed on a cluster, analyzing a dataset containing
almost 700,000 projects, history information from millions of
revisions, millions of Java source files, and billions of AST nodes.
The language also provides an easy to comprehend visitor syntax to
ease writing source code mining queries. The underlying
infrastructure contains several optimizations, including query
optimizations to make single queries faster as well as a fusion
optimization to group queries from multiple users into a single query.
The latter optimization is important as Boa is intended to be a
shared, community resource. Finally, we show the potential benefit of
Boa to the community by reproducing a previously published case
study and performing a new case study on the adoption of Java language
features.