While most countries’ lawmakers are still discussing how to put guardrails around artificial intelligence, the European Union is ahead of the pack, having passed a risk-based framework for regulating AI apps earlier this year.
The law came into force in August, although full details of the pan-EU AI governance regime are still being worked out — Codes of Practice are in the process of being devised, for example. But, over the coming months and years, the law’s tiered provisions will start to apply on AI app and model makers so the compliance countdown is already live and ticking.
Evaluating whether and how AI models are meeting their legal obligations is the next challenge. Large language models (LLM), and other so-called foundation or general purpose AIs, will underpin most AI apps. So focusing assessment efforts at this layer of the AI stack seem important.
Step forward LatticeFlow AI, a spin out from public research university ETH Zurich, which is focused on AI risk management and compliance.
On Wednesday, it published what it’s touting as the first technical interpretation of the EU AI Act, meaning it’s sought to map regulatory requirements to technical ones, alongside an open-source LLM validation framework that draws on this work — which it’s calling Compl-AI (‘compl-ai’… see what they did there!).
The AI model evaluation initiative — which they also dub “the first regulation-oriented LLM benchmarking suite” — is the result of a long-term collaboration between the Swiss Federal Institute of Technology and Bulgaria’s Institute for Computer Science, Artificial Intelligence and Technology (INSAIT), per LatticeFlow.
AI model makers can use the Compl-AI site to request an evaluation of their technology’s compliance with the requirements of the EU AI Act.
LatticeFlow has also published model evaluations of several mainstream LLMs, such as different versions/sizes of Meta’s Llama models and OpenAI’s GPT, along with an EU AI Act compliance leaderboard for Big AI.
The latter ranks the performance of models from the likes of Anthropic, Google, OpenAI, Meta and Mistral against the law’s requirements — on a scale of 0 (i.e. no compliance) to 1 (full compliance).
Other evaluations are marked as N/A where there’s a lack of data, or if the model maker doesn’t make the capability available. (NB: At the time of writing there were also some minus scores recorded but we’re told that was down to a bug in the Hugging Face interface.)
LatticeFlow’s framework evaluates LLM responses across 27 benchmarks such as “toxic completions of benign text”, “prejudiced answers”, “following harmful instructions”, “truthfulness” and “common sense reasoning” to name a few of the benchmarking categories it’s using for the evaluations. So each model gets a range of scores in each column (or else N/A).
AI compliance a mixed bag
So how did major LLMs do? There is no overall model score. So performance varies depending on exactly what’s being evaluated — but there are some notable highs and lows across the various benchmarks.
For example there’s strong performance for all the models on not following harmful instructions; and relatively strong performance across the board on not producing prejudiced answers — whereas reasoning and general knowledge scores were a much more mixed bag.
Elsewhere, recommendation consistency, which the framework is using as a measure of fairness, was particularly poor for all models — with none scoring above the halfway mark (and most scoring well below).
Other areas, such as training data suitability and watermark reliability and robustness, appear essentially unevaluated on account of how many results are marked N/A.
LatticeFlow does note there are certain areas where models’ compliance is more challenging to evaluate, such as hot button issues like copyright and privacy. So it’s not pretending it has all the answers.
In a paper detailing work on the framework, the scientists involved in the project highlight how most of the smaller models they evaluated (≤ 13B parameters) “scored poorly on technical robustness and safety”.
They also found that “almost all examined models struggle to achieve high levels of diversity, non-discrimination, and fairness”.
“We believe that these shortcomings are primarily due to model providers disproportionally focusing on improving model capabilities, at the expense of other important aspects highlighted by the EU AI Act’s regulatory requirements,” they add, suggesting that as compliance deadlines start to bite LLM makes will be forced to shift their focus onto areas of concern — “leading to a more balanced development of LLMs”.
Given no one yet knows exactly what will be required to comply with the EU AI Act, LatticeFlow’s framework is necessarily a work in progress. It is also only one interpretation of how the law’s requirements could be translated into technical outputs that can be benchmarked and compared. But it’s an interesting start on what will need to be an ongoing effort to probe powerful automation technologies and try to steer their developers towards safer utility.
“The framework is a first step towards a full compliance-centered evaluation of the EU AI Act — but is designed in a way to be easily updated to move in lock-step as the Act gets updated and the various working groups make progress,” LatticeFlow CEO Petar Tsankov told TechCrunch. “The EU Commission supports this. We expect the community and industry to continue to develop the framework towards a full and comprehensive AI Act assessment platform.”
Summarizing the main takeaways so far, Tsankov said it’s clear that AI models have “predominantly been optimized for capabilities rather than compliance”. He also flagged “notable performance gaps” — pointing out that some high capability models can be on a par with weaker models when it comes to compliance.
Cyberattack resilience (at the model level) and fairness are areas of particular concern, per Tsankov, with many models scoring below 50% for the former area.
“While Anthropic and OpenAI have successfully aligned their (closed) models to score against jailbreaks and prompt injections, open-source vendors like Mistral have put less emphasis on this,” he said.
And with “most models” performing equally poorly on fairness benchmarks he suggested this should be a priority for future work.
On the challenges of benchmarking LLM performance in areas like copyright and privacy, Tsankov explained: “For copyright the challenge is that current benchmarks only check for copyright books. This approach has two major limitations: (i) it does not account for potential copyright violations involving materials other than these specific books, and (ii) it relies on quantifying model memorization, which is notoriously difficult.
“For privacy the challenge is similar: the benchmark only attempts to determine whether the model has memorized specific personal information.”
LatticeFlow is keen for the free and open source framework to be adopted and improved by the wider AI research community.
“We invite AI researchers, developers, and regulators to join us in advancing this evolving project,” said professor Martin Vechev of ETH Zurich and founder and scientific director at INSAIT, who is also involved in the work, in a statement. “We encourage other research groups and practitioners to contribute by refining the AI Act mapping, adding new benchmarks, and expanding this open-source framework.
“The methodology can also be extended to evaluate AI models against future regulatory acts beyond the EU AI Act, making it a valuable tool for organizations working across different jurisdictions.”