Authors: Ayrton San Joaquin (AISL), Rokas Gipiškis (AISL), Leon Staufer (TU Munich), and Ariel Gil (AISL)
As AI models rapidly advance, benchmarks remain critical for evaluating capabilities and safety. However, many benchmarks persist despite becoming outdated, flawed, or misaligned with their intended purpose, inflating performance metrics and potentially obscuring safety concerns.
This paper introduces comprehensive criteria for determining when benchmarks should be deprecated, which is the process of retiring benchmarks. It also proposes a three-phase deprecation framework encompassing assessment, reporting, and notification, with provisions for both partial (i.e. updating) and full deprecation.
Recognising that benchmark developers may lack incentives to deprecate their own work, the framework permits third-party deprecation by governance actors such as government agencies and industry panels. The paper includes detailed implementation guidance, including a blueprint for the EU AI Office to create deprecation lists for safety-critical benchmarks.
