← Ahmed Abdou

TL;DR (Msc. Thesis) Benchmarking Large Language Models for Legal Case Classification at the European Court of Human Rights

March 2025

In recent years, artificial intelligence, particularly large language models (LLMs) like GPT-4, has transformed numerous industries by automating complex tasks. But can these powerful tools reliably classify the significance of court cases—specifically at the European Court of Human Rights (ECtHR)? This was precisely the research question I tackled in my MSc thesis at the Technical University of Munich, working in the Legal NLP lab.

Background

The European Convention on Human Rights (ECHR), established in 1950 in response to the horrors of World War II, represents a commitment to protecting fundamental human rights across Europe. The European Court of Human Rights (ECtHR) began operating in 1959 with a mandate to interpret and enforce the Convention's guarantees across all member states of the Council of Europe.

Over the decades, the ECHR has expanded both in reach and influence. Its jurisprudence is characterized by a dynamic interpretative approach, reflecting the principle that the Convention must be interpreted in light of contemporary conditions to ensure that protections keep pace with evolving moral and social standards. As the Court itself emphasizes:

"What gives the Convention its strength and makes it extremely modern is the way the Court interprets it: dynamically, in the light of present-day conditions."
The Convention, a modern instrument (Page 7)

This concept of the Convention as a living instrument has allowed the Court to adapt its application to modern issues that could not have been anticipated when it was first created—challenges posed by new technologies, environmental concerns, and sensitive topics such as terrorism and migration.

The current case importance levels are as follows:

Methodology

To explore this, I conducted extensive experiments with six state-of-the-art LLMs across three major model families:

These models were evaluated in a zero-shot scenario—meaning they were not provided with specific examples of key case classifications beforehand. Instead, they relied solely on their general training to classify cases accurately. The experiments involved three different input scenarios:

Additionally, I implemented and tested advanced reasoning workflows aimed at improving the models' analytical capabilities:

Multi-Agent Debate

Key Findings

Model Performance Insights

Results

Workflow Enhancements

Critical Observations

The findings suggest that while models can technically handle long input contexts, they don't always benefit from them in practice.

Implications and Future Directions

My thesis underscores both the substantial potential—and the critical limitations—of leveraging LLMs in legal contexts. It highlights the urgent need for specialized methods and metrics that go beyond surface-level accuracy to assess the quality of legal reasoning.

Several promising research directions emerge:

Conclusions

This work introduced a binary classification task aimed at identifying Key Cases within ECtHR jurisprudence—a high-stakes application that pushes the boundaries of what current LLMs can handle in legal reasoning.

Through zero-shot chain-of-thought prompting, several key trends emerged:

References