Dissertation Defense

Beyond Benchmarks: Human-Aligned Evaluation Frameworks for Large Language Models

Roland DaynauthPh.D. Candidate

WHERE:

3725 Beyster Building

WHEN:

Monday, June 2, 2025 @ 9:00 am - 11:00 am
This event is free and open to the publicAdd to Google Calendar

Hybrid Event: 3725 BBB / Zoom

Abstract: The rapid advancement of Large Language Models (LLMs), such as OpenAI’s GPT-4, has revolutionized natural language processing tasks. Despite their impressive capabilities, evaluating LLMs presented a significant challenge, particularly in tasks requiring alignment with human preferences. Traditional automated metrics like BLEU and ROUGE failed to capture nuanced judgments, while methods like reinforcement learning with human feedback (RLHF) were limited in scope. These limitations were especially pronounced in open-ended real-world tasks where user satisfaction and cultural fidelity are critical.

Our work addressed these challenges by developing and validating a comprehensive framework for redefining LLM performance evaluation. First, a formal framework for ranking human evaluation of LLMs was established, identifying key properties such as transitivity, prediction accuracy, and adaptability for subjective and context-dependent tasks. Second, this research designed and validated novel automated evaluation techniques that successfully integrate generative models with small human evaluation samples, ensuring scalability while maintaining robust alignment with human preferences. Finally, these methodologies were successfully applied to diverse use cases, including subjective productivity tasks like generating progress-driven motivational content and translation for low-resource languages. In these applications, the new evaluation methods demonstrated their effectiveness in assessing linguistic and cultural fidelity.

This work bridges the gap between automated and human evaluation, offering scalable, interpretable, and human-aligned solutions to the challenges of LLM evaluation. By successfully addressing ranking, automation, and application-specific needs, this research enhances the reliability of LLM performance assessments and provides actionable insights for deploying LLMs effectively in real-world environments.

View attachment

Organizer

CSE Graduate Programs Office

Faculty Host

Prof. Jason Mars

Events

Dissertation Defense

Beyond Benchmarks: Human-Aligned Evaluation Frameworks for Large Language Models

Organizer

Faculty Host