Skip to Content

The complete framework for secure benchmark authoring and enforcement

SecureBench pairs published standards for benchmark packs with a runtime that enforces visibility lanes, sandboxed harnesses, and trusted verification. Authors get a clear path to a first pack; reviewers can trace how structure becomes behavior at run time.

Why benchmark security needs shared defaults

Teams need both guidance and enforcement: explicit contracts for repo patches and terminal workspaces, and a runtime that executes those defaults instead of trusting a single-process runner.

  • No shared safe defaultsAuthors invent pack layout, visibility rules, and harness wiring from scratch, with no common baseline for what “safe” looks like.
  • Family rules differRepository patches and terminal workspaces need explicit contracts for what agents may change and what verifiers may trust.
  • Standards stay on paperEven when authors document intent, monolithic runners do not enforce separation between public task data, candidate production, and verification.
  • Execution inherits riskUnrestricted shells, shared filesystems, and co-located secrets turn a poorly structured pack into an unauditable run.

For benchmark creators: author repo patch or terminal task packs; the compiler turns manifest, row, and asset structure into enforced visibility lanes at run time.For security reviewers: inspect the active family contracts, then trace how the framework materializes resources, isolates candidate production, and runs verification from trusted code paths.

Contracts for agentic benchmark packs

Authors design against repo patch and terminal task contracts (visibility maps, asset roots, harness policy) instead of reinventing security rules for every benchmark.

Agentic contracts
The active repo patch and terminal task contracts define row schemas, eval visibility, and authoring security notes.
Pack structure
Manifest, JSONL, and asset roots encode public, evaluation, and hidden material with defaults that steer authors away from unsafe layouts.
Tester YAML standard
Harness selection, verification policy, and egress rules live in a shared schema, not a one-off per benchmark.

Canonical standards for tester YAML and security policy ship in the SecureBench repo. The Benchmark Packs and Tester Configuration guides expand them with architecture context here.

Documentation

Three guides cover the first run, pack authoring, and security review. Use the rest for architecture, reference, and extension.

Current pipeline: agentic benchmark packs and tester YAML with JSONL output (candidates.jsonl). Active families are repo_patch and terminal_task.

Runtime enforcement at compile time and on disk

The framework compiles contract-shaped packs into SecureBenchTask objects, runs candidate production in sandboxed harnesses, and verifies from trusted paths. Visibility lanes are enforced when tasks compile and when resources materialize, not only described in docs.

How typical benchmark runners differ from SecureBench
Typical approachSecureBench
Each benchmark invents its own layout and security postureAgentic family contracts define safe pack, row, and tester defaults
Runners execute tasks however authors wired themCompiler maps eval fields to visibility lanes per family contract
One process reads prompts and scores answersSeparate harness (producer) and verifier phases
Shell and network access are often unrestrictedDocker sandboxes default to no network; optional allowlisted egress
  • publicPrompts, instructions, and assets the harness may expose to candidate production.
  • evaluation_inputsTests and checkers staged for verification; never included in agent payloads.
  • hiddenGold patches, expected state, and trusted analysis data reserved for verifier code paths.

Agent phase

  • Public payload only
  • Harness sandbox with path policy
  • Candidate extraction, no verifier scripts

Verifier phase

  • Hidden and evaluation inputs available
  • Fresh verifier sandbox per check
  • Trusted checker code writes the score

Run a smoke benchmark, then deepen the standards

Start with Getting Started, author against the active family contracts, and open the security model when you need the full threat and visibility picture.

Open Getting Started