Boa Meets Python: A Boa Dataset of Data Science Software in Python Language

Thumbnail Image
Date
2019-05-01
Authors
Biswas, Sumon
Islam, Md Johirul
Huang, Yijia
Rajan, Hridesh
Major Professor
Advisor
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Authors
Person
Rajan, Hridesh
Professor and Department Chair of Computer Science
Research Projects
Organizational Units
Organizational Unit
Journal Issue
Is Version Of
Versions
Series
Department
Computer Science
Abstract

The popularity of Python programming language has surged in recent years due to its increasing usage in Data Science. The availability of Python repositories in Github presents an opportunity for mining software repository research, e.g., suggesting the best practices in developing Data Science applications, identifying bug-patterns, recommending code enhancements, etc. To enable this research, we have created a new dataset that includes 1,558 mature Github projects that develop Python software for Data Science tasks. By analyzing the metadata and code, we have included the projects in our dataset which use a diverse set of machine learning libraries and managed by a variety of users and organizations. The dataset is made publicly available through Boa infrastructure both as a collection of raw projects as well as in a processed form that could be used for performing large scale analysis using Boa language. We also present two initial applications to demonstrate the potential of the dataset that could be leveraged by the community.

Comments

This is a pre-print, available at: http://design.cs.iastate.edu/papers/MSR-19/.

Description
Keywords
Citation
DOI
Source
Copyright
Tue Jan 01 00:00:00 UTC 2019
Collections