Campus Units

Computer Science, Sociology

Document Type

Article

Publication Version

Submitted Manuscript

Publication Date

2017

Journal or Book Title

Social Science Research Network

Abstract

The emerging field of data science has rapidly evolved into an extremely diverse field equipped with multi-disciplinary techniques to extract, analyze and classify structured and unstructured data. These methods offer researchers, policy analysts, and the lay public evidence-based insights into a tremendous range of human, organizational, and societal activities on a scale and scope that has rarely been possible with conventional scientific methods. At present, however, the multi-disciplinary nature of the data science space suffers a ‘language’ problem insofar as data scientists from different fields often use different terms to describe common methods and concepts. The aim of the present research is threefold. First, we report results of a literature review that identifies and defines the essential content domain of data science, with special focus on the classification of data collection techniques. Second, we establish a preliminary set of relationships among the most trafficked terms of data science to facilitate interdisciplinary communication among scientists from heterogeneous fields. And third, we develop a classification scheme of web-scraping methods based on their availability, the quality of the data procured by the method, the ease of data extraction, reproducibility, the technical skills required to leverage each method, and the types of data collected by each method.

Comments

This is a pre-print made available through Social Science Research Network: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2920842.

Copyright Owner

The Authors

Language

en

File Format

application/pdf

Share

COinS