Applying Natural Language Processing Tools to a Student Academic Writing Corpus: How Large are Disciplinary Differences Across Science and Engineering Fields?

Thumbnail Image
Date
2017-01-01
Authors
Crossley, Scott
Russell, David
Kyle, Kristopher
Romer, Ute
Major Professor
Advisor
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Authors
Person
Russell, David
Professor Emeritus
Research Projects
Organizational Units
Organizational Unit
English

The Department of English seeks to provide all university students with the skills of effective communication and critical thinking, as well as imparting knowledge of literature, creative writing, linguistics, speech and technical communication to students within and outside of the department.

History
The Department of English and Speech was formed in 1939 from the merger of the Department of English and the Department of Public Speaking. In 1971 its name changed to the Department of English.

Dates of Existence
1939-present

Historical Names

  • Department of English and Speech (1939-1971)

Related Units

Journal Issue
Is Version Of
Versions
Series
Department
English
Abstract

• Background: Researchers have been working towards better understanding differences in professional disciplinary writing (e.g., Ewer & Latorre, 1969; Hu & Cao, 2015; Hyland, 2002; Hyland & Tse, 2007) for decades. Recently, research has taken important steps towards understanding disciplinary variation in student writing. Much of this research is corpus-based and focuses on lexico-grammatical features in student writing as captured in the British Academic Written English (BAWE) corpus and the Michigan Corpus of Upper-level Student Papers (MICUSP). The present study extends this work by analyzing lexical and cohesion differences among disciplines in MICUSP. Critically, we analyze not only linguistic differences in macro-disciplines (science and engineering), but also in micro-disciplines within these macro-disciplines (biology, physics, industrial engineering, and mechanical engineering).

• Literature Review: Hardy and Römer (2013) used a multidimensional analysis to investigate linguistic differences across four macro-disciplines represented in MICUSP. Durrant (2014, in press) analyzed vocabulary in texts produced by student writers in the BAWE corpus by discipline and level (year) and disciplinary differences in lexical bundles. Ward (2007) examined lexical differences within micro-disciplines of a single discipline.

• Research Questions: The research questions that guide this study are as follows:

1. Are there significant lexical and cohesive differences between science and engineering student writing? 2. Are there significant lexical and cohesive differences between micro-disciplines within science and engineering student writing?

• Research Methodology: To address the research questions, student-produced science and engineering texts from MICUSP were analyzed with regard to lexical sophistication and textual features of cohesion. Specifically, 22 indices of lexical sophistication calculated by the Tool for the Automatic Analysis of Lexical Sophistication (TAALES; Kyle & Crossley, 2015) and 38 cohesion indices calculated by the Tool for the Automatic Analysis of Cohesion (TAACO; Crossley, Kyle, & McNamara, 2016) were used. These features were then compared both across science and engineering texts (addressing Research Question 1) and across micro-disciplines within science and engineering (biology and physics, industrial and mechanical engineering) using discriminate function analyses (DFA).

• Results: The DFAs revealed significant linguistic differences, not only between student writing in the two macro-disciplines but also between the micro-disciplines. Differences in classification accuracy based on students’ years of study hovered at about 10%. An analysis of accuracies of classification by paper type found they were similar for larger and smaller sample sizes, providing some indication that paper type was not a confounding variable in classification accuracy.

• Discussion: The findings provide strong support that macro-disciplinary and micro-disciplinary differences exist in student writing in these MICUSP samples and that these differences are likely not related to student level or paper type. These findings have important implications for understanding disciplinary differences. First, they confirm previous research that found the vocabulary used by different macro-disciplines to be “strikingly diverse” (Durrant, 2015), but they also show a remarkable diversity of cohesion features. The findings suggest that the common understanding of the STEM disciplines as “close” bears reconsideration in linguistic terms. Second, the lexical and cohesion differences between micro-disciplines are large enough and consistent enough to suggest that each micro-discipline can be thought of as containing a unique linguistic profile of features. Third, the differences discerned in the NLP analysis are evident at least as early as the final year of undergraduate study, suggesting that students at this level already have a solid understanding of the conventions of the disciplines of which they are aspiring to be members. Moreover, the differences are relatively homogeneous across levels, which confirms findings by Durrant (2015) but, importantly, extends these findings to include cohesion markers.

• Conclusions: The findings from this study provide evidence that macro-disciplinary and micro-disciplinary differences at the linguistic level exist in student writing, not only in lexical use but also in text cohesion. A number of pedagogical applications of writing analytics are proposed based on the reported findings from TAALES and TAACO. Further studies using different corpora (e.g., BAWE) or purpose assembled corpora are suggested to address limitations in the size and range of text types found within MICUSP. This study also points the way toward studies of disciplinary differences using NLP approaches that capture data which goes beyond the lexical and cohesive features of text, including the use of part-of-speech tags, syntactic parsing, indices related to syntactic complexity and similarity, rhetorical features, or more advanced cohesion metrics (latent semantic analysis, latent Dirichlet allocation, Word2Vec approaches).

Comments

This article is published as Crossley, Scott, David Russell, Kristopher Kyle, and Ute Römer. "Applying Natural Language Processing Tools to a Student Academic Writing Corpus: How Large are Disciplinary Differences Across Science and Engineering Fields?." Journal of Writing Analytics 1 (2017).

Description
Keywords
Citation
DOI
Source
Copyright
Sun Jan 01 00:00:00 UTC 2017
Collections