Independent reference. No vendor affiliation. Scores cited with source and capture date. Affiliate disclosure.
Abstract
WhatOpen-source Python eval framework for LLMs and agents; MIT-licensed.
WhoUK AI Safety Institute (UK AISI), 2024 release.
2026 UseFrontier safety eval standard; widely used by AISIs and frontier labs.
Projectinspect.ai-safety-institute.org.uk
Section V.iv Tools|Last verified April 2026

Inspect by UK AISI: The Eval Framework Behind Sonnet 4.7 Safety Tests

The framework that became the de facto standard for AISI-grade safety evaluation in 18 months.

I

Architecture

Inspect organises an evaluation around three abstractions. A dataset is a collection of samples (input plus expected target). A solver is the strategy the model uses to produce an answer (one-shot, multi-shot, chain-of-thought, tool-use agent). A scorer evaluates the produced answer against the target (exact match, model-graded, custom function). All three are pluggable Python objects.

This is a deliberately small surface. A simple multiple-choice eval is a 20-line Python file. A complex agentic eval with sandboxed code execution and tool calls is a few hundred lines but uses the same abstractions. UK AISI chose this design because it had to support evaluations across capability, safety, and agentic dimensions without forking the framework per category.

II

Why It Spread

Inspect spread quickly across the frontier-lab ecosystem because it solved three pain points. First, audit-trail logging: every eval run produces a structured log that captures the prompt, the model response, the score, and the metadata. Second, provider portability: the same eval runs against any model with a one-line provider switch. Third, sandboxing: Inspect ships a Docker-based sandbox for agentic evaluations that need to run untrusted code without compromising the host.

III

Limitations

Inspect is Python-first and assumes the team running it can read and write Python. Teams that prefer a no-code or YAML-only eval definition will find OpenAI Evals or Promptfoo more accessible. Inspect's sandboxing is also Docker-dependent, which makes it harder to run on machines without Docker (some macOS setups, some restricted CI environments).

OpenAI Evals comparisonHELM benchmark suiteAll eval tools compared
Reader Questions
Q.01What is Inspect?+
Inspect is an open-source LLM evaluation framework developed by the UK AI Safety Institute (UK AISI) and released under the MIT licence in 2024. It is the framework UK AISI uses internally for their pre-deployment safety testing of frontier models including Sonnet 4.x, GPT-4o, GPT-5, Gemini 2, and Gemini 3. It is written in Python, ships a CLI and a programmatic API, and integrates with HuggingFace, OpenAI, Anthropic, Google, and Together AI.
Q.02How is Inspect different from OpenAI Evals?+
Inspect is designed for capability and safety evaluation, including agentic tasks with tool use, sandboxed code execution, and multi-turn dialogues. OpenAI Evals is more focused on single-turn correctness and is tied to OpenAI infrastructure idioms. Inspect's solver and tool abstractions are more general and easier to extend to agentic settings.
Q.03Is Inspect tied to UK AISI models or evaluations?+
No. Inspect is fully open-source and provider-agnostic. UK AISI ships some of its own published evaluations (cyber, biology, autonomy) as Inspect solver packages, but the framework itself runs against any model and any evaluation. UK AISI's choice to open-source rather than keep proprietary was a deliberate ecosystem move.
Q.04When should I pick Inspect over LMEval or OpenAI Evals?+
Pick Inspect when (a) you need agentic evaluation with tool use and sandboxing, (b) you want a single framework for capability and safety, (c) you care about audit-trail logging (Inspect produces structured run logs that survive replication). Pick LMEval (lm-evaluation-harness) when you are doing pure capability eval on multiple-choice benchmarks. Pick OpenAI Evals when you are running inside OpenAI's ecosystem and care primarily about correctness.
Q.05Who else uses Inspect besides UK AISI?+
By mid-2025 Anthropic, Google DeepMind, Meta, and the US AI Safety Institute all run Inspect for at least some of their internal safety evaluations. The ecosystem of third-party Inspect-compatible eval packages includes METR (autonomy evaluations), Apollo Research (scheming evaluations), and academic teams across UK, US, and EU.

Sources

  1. [1] Inspect documentation: inspect.ai-safety-institute.org.uk
  2. [2] Inspect repository: github.com/UKGovernmentBEIS/inspect_ai
  3. [3] UK AISI launch post: aisi.gov.uk/work/inspect
From the editor

Benchmarking Agents Review is published by Digital Signet, an independent firm that builds and ships AI agents in production for mid-market companies. If you are evaluating, designing, or productionising LLM agents and want a working second opinion, get in touch.

Book a 30-min scoping callDigital Signet →

30 minutes, free, independent.·1-page action plan within 48h.·Honest if not the right fit.