LLM-Hard Assessment

TEXT will develop techniques that evaluate LLM performance on complex real-world scenarios and fine-tune models that rely on alignment techniques in order to excel on these LLM-Hard problems

This work package focuses on developing techniques to evaluate large language models on complex real-world scenarios and refining models through alignment strategies to improve their performance on challenging tasks (i.e., tasks that requires the models to learn non-differential properties). By addressing current limitations in evaluation practices, this research will contribute to more robust and human-aligned AI systems.

Most existing evaluation methods assess AI performance on relatively simple tasks that fail to capture the full range of capabilities required for advanced, real-world applications. These 'LLM-Easy' evaluations focus on text generation, question-answering, and keyword-based search, but they do not test deeper contextual understanding, creativity, or adaptability. This work package challenges this approach by advocating for 'LLM-Hard' evaluations that require nuanced reasoning, long-text generation, ontology learning, and cultural value alignment.

To bridge the gap between conceptual research and technical application, the work package translates insights from creative AI studies into rigorous benchmarks and evaluation protocols. By refining models to handle complex linguistic and contextual challenges, the research team will contribute to more meaningful AI-assisted writing, decision-making, and cultural interpretation.

The research will be structured around four key tasks: 1) developing new benchmarks for LLM-Hard problems, 2) evaluating model performance using these benchmarks, 3) designing an alignment framework to ensure AI models meet human-centered criteria, and 4) fine-tuning and retraining models to enhance their ability to engage with real-world complexities. The project will also emphasize human-in-the-loop strategies, ensuring that model development aligns with human expectations and values.

This work package is housed at the Center for Humanities Computing at Aarhus University and collaborates with Danish Foundation Models. The team consists of researchers, engineers, and project managers working together to advance AI evaluation and deployment. The ultimate goal is to ensure that AI models not only perform well on structured tasks but also adapt meaningfully to complex, dynamic environments.

Directed by Kristoffer L. Nielbo

Revised 19.03.2025

Line Ejby Sørensen