AI in University Assessment: Evaluating the Opportunities and Risks of Automated Marking

Generative Artificial Intelligence (AI) has disrupted the entire education sector. The ai@cam OpRaise project focuses on the ability of AI, particularly Large Language Models (LLMs), to evaluate students’ work, particularly their long-form responses to open-ended questions. Many students report using LLMs to seek feedback on essays, including in high-stakes situations. The research in question was whether AI-generated numerical feedback is sufficiently robust to support students and educators. The project contextualised evidence by considering the views of stakeholders on the broader opportunities and risks of integrating AI systems into University assessment practices.

Key Findings

  • AI marking accuracy was only moderate at best, with degree band agreement ranging from 35%–63% across universities — below the threshold needed for confident deployment.
  •  AI systems showed a central tendency bias, compressing marks toward the middle and performing worst at grade boundaries and for the highest and lowest-performing students.
  • AI marks were oversensitive to surface features like essay length and vocabulary range, rather than the quality of academic reasoning that human markers prioritise.
  • Performance varied significantly across institutions, meaning results from one context cannot be used as evidence of readiness elsewhere.
  • Students and staff view human contact and judgement as fundamental to the social contract of higher education — many students said they would feel “cheated” if AI marked their work.

Recommendations

  • Proceed with caution — current AI systems are not accurate or valid enough for formal assessment use.
  • Evaluate AI locally before any deployment, using own materials, cohort, and marking practices.
  • Preserve human authority over final marks in all scenarios — AI should support human judgement, not replace it.
  • Engage staff and students openly before any adoption, building trust through transparency and concrete discussion of specific use cases.
  • Monitor continuously — AI model performance is unstable over time and institution-specific, making ongoing review essential.