Campus Units

Computer Science, Sociology

Document Type


Publication Version

Submitted Manuscript

Publication Date


Journal or Book Title

Social Science Research Network


The emerging field of data science has rapidly evolved into an extremely diverse field equipped with multi-disciplinary techniques to extract, analyze and classify structured and unstructured data. These methods offer researchers, policy analysts, and the lay public evidence-based insights into a tremendous range of human, organizational, and societal activities on a scale and scope that has rarely been possible with conventional scientific methods. At present, however, the multi-disciplinary nature of the data science space suffers a ‘language’ problem insofar as data scientists from different fields often use different terms to describe common methods and concepts. The aim of the present research is threefold. First, we report results of a literature review that identifies and defines the essential content domain of data science, with special focus on the classification of data collection techniques. Second, we establish a preliminary set of relationships among the most trafficked terms of data science to facilitate interdisciplinary communication among scientists from heterogeneous fields. And third, we develop a classification scheme of web-scraping methods based on their availability, the quality of the data procured by the method, the ease of data extraction, reproducibility, the technical skills required to leverage each method, and the types of data collected by each method.


This is a pre-print made available through Social Science Research Network:

Copyright Owner

The Authors



File Format