Automated Writing Evaluation: Methods and Shortcomings

9 min readJan 10, 2021

When I taught English, I was concerned, in general, with the place and function of technology in first-year composition classrooms. More specifically, I was looking for the ways in which feedback is received online, and how computer interfaces affect student uptake and perceptions of feedback. However, this line of inquiry led me to some interesting sources on automated writing evaluations (AWEs), computer programs that aim to provide varying levels of feedback and/or holistic scoring of student work.

As an FYC instructor, the promise of less time spent on mechanics, leaving more time for higher-level concerns, was alluring. I narrowed the scope of my research to focus on the potential role of AWEs, with a mind toward supplementing my feedback to students. However, research in this field leaves questions that demand further attention. Further research and refinement of these relatively new technologies may eventually help instructors to provide deeper, more thorough feedback on student work.

AWE is a broad category that applies to many different types of software, with many different features, methods, and aims. Two basic applications emerge from the research: machine scoring and computer-generated feedback. Machine scoring of writing (MSW) is a form of assessment becoming more common to standardized testing, in which a computer program provides a holistic test score based on algorithmic evaluation of student writing.

Well known tests like the Graduate Management Admissions Test (GMAT) incorporate MSW by evaluating student responses with one human reader and one machine reader, as opposed to the two human readers that were once standard. (McCurry). MSW is a type of summative assessment, tabulating a grade. Meanwhile, computer-generated feedback provided by commercial software like spelling and grammar checkers, or the more robust proofreading software of recent years, can be used by students directly as a part of the revision process, allowing them to respond and improve their writing.

Claims about the reliability of MSW date back to the early-2000s. The agreement rate between “e-rater” and human readers on GMAT essays is said to be 98 percent (McCurry). Huot et al., however, lament in their history of writing assessment that “the concept of reliability ha[s] continued to focus primarily on interrater reliability, the ability of two readers to give the same score for the same piece of writing.” They point out that such reliability is a prerequisite, but not a guarantee, of “validity,” a more recent consideration of composition pedagogy which is concerned with “the correlation of scores on a test with some other objective measure of that which the test is used to measure” (Huot et al.). The focus on reliability ignores the question of how much such assessments say about students, their progress, or their potential.

This problem of reliability versus validity has been a staple of testing debates for nearly a century, even with human scorers. Now, the same question is leveled at MSW: can it measure what it aims to measure? Can a machine evaluate writing in a way that addresses the complex concerns of writing instructors? In “Toward an Artful Critique of Reform: Responding to Standards, Assessment, and Machine Scoring,” Webber raises this question. He discusses opposition to machine scoring in the context of broader political reforms of education but provides an example of the kinds of concerns instructors have with MSW. The Human Readers petition, produced in response to the Common Core, argued, no, machines cannot “‘measure the essentials of effective written communication: accuracy, reasoning, adequacy of evidence, good sense, ethical stance, convincing argument, meaningful organization, clarity, and veracity, among others’… because its methodology is ‘trivial,’ ‘reductive,’ ‘inaccurate,’ ‘undiagnostic,’ ‘unfair,’ and ‘secretive’” (qtd. in Webber).

This seems to echo McCurry’s findings in “Can machine scoring deal with broad and open writing tests as well as human readers?” which demonstrated that, while MSW was reliable in scoring “specific and constrained writing tasks,” it was unreliable in scoring more “broad and open writing tasks,” where scorers needed to consider meaning and understand arguments. There seems to be some consensus among my sources that MSW limits the type of evaluation possible and fails to address higher-level concerns that require interpretation of meaning. This makes it unpopular among compositionists, who see it as a means of displacing their professional judgment (Webber).

The greatest problem facing MSW, and AWEs more generally, is their ability to be gamed. AWEs can only measure. They cannot understand. Webber quotes Les Perelman’s Boston Globe article “Flunk the Robo-Graders,” saying “Robo-graders do not score by understanding meaning but almost solely by the use of gross measures, especially length and the presence of pretentious language,” which can result in high scores for such nonsense as “According to professor of theory of knowledge Leon Trotsky, privacy is the most fundamental report of humankind. Radiation on advocates to an orator transmits gamma rays of parsimony to implode.” So, while MSW can reliably agree with human scorers, it has little or no validity. Without human oversight to verify meaning, computers can’t evaluate significant features of strong writing like organization of ideas or sound argument.

Instructors and researchers have not, however, totally denounced AWEs. The “gross measures” they do address can still be useful information for students and instructors. This gives rise to a discussion of the second application of AWEs: computer-generated feedback. Things as simple and familiar as spelling- and grammar-checkers have gained acceptance for their ability to direct writers’ attentions to surface-level errors in their text. In recent years, more sophisticated AWEs have become commercially available, aiming to measure factors like overuse of words, variation of sentence length, and use of passive voice. Some instructors and researchers have begun to grapple with the efficacy of these new technologies as tools in the classroom. Ware, for instance, cites a “resounding consensus about computer-generated feedback, among developers and writing specialists alike…that the time is ripe for critically examining its potential use as a supplement to writing instruction, not as a replacement.”

Potter and Fuller’s article “My New Teaching Partner? Using the Grammar Checker in Writing Instruction,” provides some insight into the strengths of AWEs in the classroom, as well as a method for instructors to leverage their shortcomings. Instructors must teach students to be discerning with the software’s advice. She explains:

Vernon recommends teaching the checker’s limitations and how students might work with these (336), including activities where learners respond to grammar-check recommendations in small groups, make corrections on highlighted errors without the help of computer suggestions, create sentences to trigger the grammar checker or fool it, and compare rules in the grammar checker to rules in the grammar handbook (346).

In this view, even reductive and inaccurate feedback can be a tool for teaching students both mechanics and agency over a text. “For instance,” she says, “my students need someone to explain why the powerful grammar checker does not correct such sentences as “Little Women were a great book,” or “The cows or the pig find the grass” (Potter and Fuller).

However, AWEs must be more than a foil for teachers’ higher competence and understanding of context. El-Ebyary and Scott, Zhe, and Bond and Pennebaker all found in their studies that students do heed advice given by AWEs. In particular, El-Ebyary and Scott, and Zhe both emphasized that students using AWEs as a tool during the writing process, as opposed to using AWEs for scoring, expressed that the feedback provided by the AWEs encouraged students to put increased value on the revision process and prompted them to produce more drafts. This finding must of course be balanced against Ware’s finding that the programs’ algorithmic evaluation tends to highlight surface-level errors which may detract from instructors’ higher-level concerns. El-Ebyary and Scott ultimately conclude, in seeming agreement with Ware, that these programs are best used as assistants in conjunction with instructor feedback, meaning the programs still require human oversight and fail to address higher-level concerns about voice, structure, and rhetorical sophistication.

Similarly, Dembsey, who evaluated the commercial software, Grammarly®, addresses Ware’s concern with the shortcomings of AWEs. Namely, AWEs are still no match for humans when offering feedback on higher-level concerns. Dembsey looks at Grammarly’s potential to supplement writing center tutoring and finds that Grammarly is unable to address structure and meaning. Dembsey and Ware share another similar finding: teachers and students had positive things to say interface and user-friendliness, but there was less agreement on accuracy and consistency. In fact, they find, along with Zhe, that the algorithmic nature of AWEs tends to produce instances of incorrect grammar advice. Zhe offers that this likely has to do with the machines’ misinterpretation of the grammatical function of certain words because of nuances like homonymy.

Zhe also found that the student in her case study was able to evaluate the computer-generated feedback and use her own judgement to decide whether to adopt suggestions. This suggests, in conjunction with the finding that computer-generated feedback encourages revision, that interacting with AWEs has the potential to challenge students’ understanding and increase their sense of agency over their text. Ware found that the effect computer-generated feedback has on student writing “depends largely on how writing is defined and how [feedback] is implemented.” In her own study, she found significant improvement to writing scores after 90 minutes a week for six weeks. However, she indicates that other studies with less structure had doubtful results. Thus, she recommends emphasizing long-term commitment to a program, using “writing assistance tools” over scoring, and “balanc[ing] provision of teacher, peer, and computer-generated feedback.

The sources I’ve examined look at several different programs of varying sophistication, including Potter’s use of word-processor grammar-checkers, Bond and Pennebaker’s limited analysis of pronoun usage in expressive writing, Zhe’s Chinese ESL assessment software, and the more commonly discussed, commercially available MY Access!®, Criterion®, and Grammarly®. The problem not even yet with the AWEs themselves, but with research about AWEs, is that the novelty of the technology, as well as teachers’ differences in perceptions, expectations, and applications create a problem with methodology. Studies are extremely difficult to generalize because they utilize small sample sizes, comprising those willing to experiment with the new technology (in Zhe’s case, a single student). Furthermore, Ware notes that teachers are given broad freedoms in the implementation of different programs which are themselves disparate in methods, aims, and outcomes. With no one program being evaluated in no one way, it becomes very difficult to determine what effect any given program or method of implementation might have when practiced more broadly.

Whether looking at MSW with Webber and McCurry or computer-generated feedback with Ware, Zhe, and Potter and Fuller, one major finding on AWEs rings clear: they understand nothing. No AWE now in use is capable of commenting on meaning, reason, structure or argument. In order to be reliable, they must be given narrowly constrained writing tasks. They rely on reductive, mechanical metrics of grammar and syntax which can be fooled or flatly mistaken. This calls into question the validity of their measures in scoring tests, as well as the efficacy of feedback to students. However, scholars like Zhe, and Potter and Fuller, are examining the ways in which educators can take full advantage of the limited scope of AWEs, even leveraging their shortcomings. Ultimately, Ware warns that “over the long term, effects on the more observable mechanistic and formulaic aspects of writing may be counterproductive if they lead teachers and students further away from writing purposefully for real audiences,” so it’s important that instructors continue to question the ways these tools can serve higher-level goals in writing. Further research will be necessary, as the research, like the tools themselves, is still in its infancy. However, I can agree with Ware and Webber that the best place for compositionists is at the table, working together with programmers and policy makers to affect changes to our metrics for meaningful improvement to student writing.

Resources

· Mlynarczyk, Rebecca Williams. Storytelling and Academic Discourse: Including More Voices in the Conversation. Journal of Basic Writing, 2014. Vol. 33 Issue 2, pp. 4–22.

· Ching Ching, Lin. Storytelling as Academic Discourse: Bridging the Cultural-Linguistic Divide in the Era of the Common Core. Journal of Basic Writing, 2014. Vol. 33 Issue 2, pp. 52–73.

· Tappenden, Curtis. Out of our minds: Exploring attitudes to creative writing relating to art and design practice and personal identity. Journal of Writing in Creative Practice, 2010. Vol. 3 Issue 3, pp. 257–283.

· Moore, Kristen. Exposing Hidden Relations: Storytelling, Pedagogy, and the Study of Policy. Journal of Technical Writing & Communication, 2013. Vol. 43 Issue 1, pp. 63–78.

· Corkery, Caleb. Literacy Narratives and Confidence Building in the Writing Classroom, 2005. Journal of Basic Writing. Vol. 24 Issue 1, pp. 48–67.

· Alexander, Kara Poe. From Story to Analysis: Reflection and Uptake in the Literacy Narrative Assignment, 2015. Composition studies : Freshman English news. Vol. 43 Issue 2. pp. 43–71.

· Hall, Anne-Marie and Minnix, Christopher. Beyond the Bridge Metaphor: Rethinking the Place of the Literacy Narrative in the Basic Writing Curriculum. Journal of Academic Writing, 2012. Volume 31. pp. 57–82.

· Detmering, Robert. Research Papers Have Always Seemed Very Daunting”: Information Literacy Narratives and the Student Research Experience. Libraries and the Academy, 2001. Volume 12.

Automated Writing Evaluation: Methods and Shortcomings

Resources

Written by Dave Frame