Home
About
Team
Public Outputs
Contact

Deprecating Benchmarks: Criteria and Framework

Published on

July 8, 2025

Authors: Ayrton San Joaquin (AISL), Rokas Gipiškis (AISL), Leon Staufer (TU Munich), and Ariel Gil (AISL)

As AI models rapidly advance, benchmarks remain critical for evaluating capabilities and safety. However, many benchmarks persist despite becoming outdated, flawed, or misaligned with their intended purpose, inflating performance metrics and potentially obscuring safety concerns.

This paper introduces comprehensive criteria for determining when benchmarks should be deprecated, which is the process of retiring benchmarks. It also proposes a three-phase deprecation framework encompassing assessment, reporting, and notification, with provisions for both partial (i.e. updating) and full deprecation.

Recognising that benchmark developers may lack incentives to deprecate their own work, the framework permits third-party deprecation by governance actors such as government agencies and industry panels. The paper includes detailed implementation guidance, including a blueprint for the EU AI Office to create deprecation lists for safety-critical benchmarks.

Read Full Paper

Open Problems in AI Incident Governance

July 7, 2026
NIST AI 800-2: Our Recommendations on Benchmark Lifecycle and Deprecation

April 9, 2026
Recommendations for the EU AI Act Digital Omnibus Trilogue

April 7, 2026

Home
About
Team
Public Outputs
Contact
Privacy Policy

Mail
LinkedIn

AI Standards Lab, Inc. is a Delaware nonprofit corporation operating through a fiscal sponsorship with Players Philanthropy Fund, Inc, a Texas nonprofit corporation recognized by IRS as a tax-exempt public charity under Section 501(c)(3) of the Internal Revenue Code (Federal Tax ID: 27-6601178, ppf.org/pp). Contributions to AI Standards Lab qualify as tax-deductible to the fullest extent of the law.

EU Transparency Register: 060933793069-24

Deprecating Benchmarks: Criteria and Framework

More posts

Open Problems in AI Incident Governance

NIST AI 800-2: Our Recommendations on Benchmark Lifecycle and Deprecation

Recommendations for the EU AI Act Digital Omnibus Trilogue