Independent reference. No vendor affiliation. Scores cited with source and capture date. Affiliate disclosure.
Abstract
WhatOpen-source LLM evaluation framework with YAML configs, JSONL datasets, and built-in plus model-graded scoring.
WhoOpenAI (March 2023 launch alongside GPT-4).
2026 UseCapability eval with a large community-contributed catalogue; agentic eval is weaker.
Repositorygithub.com/openai/evals
Section V.v Tools|Last verified April 2026

OpenAI Evals: 400+ Public Evals, How to Add One in 30 Lines

The first open-source LLM eval framework with a real community. Still the easiest place to fork a starter eval.

I

The Shape of an Eval

An OpenAI Evals run consists of a YAML config naming the eval and pointing to a JSONL of samples, a completion function specifying the model under test, and a scorer specifying how to grade outputs. The framework iterates samples, runs the completion function on each, applies the scorer, and reports aggregate metrics. Logs are written in a structured format that can be replayed.

II

Community Catalogue

The evals/registry directory in the public repository contains over 400 evals contributed by the community. Examples include US legal exam questions, French verb conjugation, medical diagnosis from clinical vignettes, code-completion edge cases, jailbreak resistance probes. The quality varies. A few high-traffic evals (logic puzzles, basic arithmetic) are well-curated; many one-off contributions have small sample sizes or ambiguous rubrics. Treat the catalogue as a starting set, not a published benchmark.

III

Where It Fits in 2026

OpenAI Evals is still actively maintained and useful for spinning up a custom capability eval quickly. For frontier safety evaluation, Inspect has overtaken it. For multi-provider production observability, LangSmith, Langfuse, and Braintrust are stronger. OpenAI Evals's niche in 2026 is the place to prototype a one-off eval before promoting it to a more production-grade framework.

Inspect (UK AISI)HELMAll eval tools compared
Reader Questions
Q.01What is OpenAI Evals?+
OpenAI Evals is the open-source LLM evaluation framework that OpenAI released in March 2023 alongside the GPT-4 launch. The framework runs an LLM against a JSONL dataset, scores each output, and reports aggregate metrics. It supports built-in scorers (exact match, fuzzy match, regex) and model-graded scoring (a second LLM judges the first). The repository ships 400+ community-contributed evals.
Q.02Is it tied to OpenAI models?+
It defaults to OpenAI models but is provider-agnostic through the completion-function adapter. Several community forks add Anthropic, HuggingFace, and Together AI providers. The native dependency on the OpenAI SDK is the primary reason most teams running on multi-provider stacks prefer LangSmith, Inspect, or LMEval instead.
Q.03How do I add my own eval?+
Write a JSONL of {input, ideal} pairs, write a YAML config that names the eval and points to the data and the scorer, and run oaieval. A simple multiple-choice eval is roughly 30 lines including YAML. Model-graded scorers add another 10 to 30 lines depending on whether you customise the rubric. The framework is intentionally minimal.
Q.04What is model-graded scoring and when does it fail?+
Model-graded scoring uses a second LLM to judge whether the first LLM's answer is correct. It fails when the judge model shares biases with the candidate model (both prefer their own writing style), when the rubric is ambiguous (different judge prompts produce different scores), or when the candidate hits a refusal that the judge interprets as failure. Best practice is to use a stronger model as the judge than the candidate, and to spot-check judge labels against human annotation.
Q.05How does it compare to Inspect?+
OpenAI Evals is older, more YAML-driven, and has a larger archive of community evals. Inspect (UK AISI) is more Python-native, has stronger agentic and sandboxing primitives, and is more actively used by frontier safety teams. For capability evaluation on a fixed multiple-choice benchmark, either works. For agentic evaluation with tool use and sandbox, Inspect is the stronger fit.

Sources

  1. [1] OpenAI Evals repository: github.com/openai/evals
  2. [2] OpenAI Evals launch (Mar 2023): openai.com/index/gpt-4-research
  3. [3] Evals documentation: github.com/openai/evals/docs/build-eval.md
From the editor

Benchmarking Agents Review is published by Digital Signet, an independent firm that builds and ships AI agents in production for mid-market companies. If you are evaluating, designing, or productionising LLM agents and want a working second opinion, get in touch.

Book a 30-min scoping callDigital Signet →

30 minutes, free, independent.·1-page action plan within 48h.·Honest if not the right fit.