The complete framework for secure benchmark authoring and enforcement
SecureBench pairs published standards for benchmark packs with a runtime that enforces visibility lanes, sandboxed harnesses, and trusted verification. Authors get a clear path to a first pack; reviewers can trace how structure becomes behavior at run time.
Why benchmark security needs shared defaults
Teams need both guidance and enforcement: explicit contracts for repo patches and terminal workspaces, and a runtime that executes those defaults instead of trusting a single-process runner.
- No shared safe defaultsAuthors invent pack layout, visibility rules, and harness wiring from scratch, with no common baseline for what “safe” looks like.
- Family rules differRepository patches and terminal workspaces need explicit contracts for what agents may change and what verifiers may trust.
- Standards stay on paperEven when authors document intent, monolithic runners do not enforce separation between public task data, candidate production, and verification.
- Execution inherits riskUnrestricted shells, shared filesystems, and co-located secrets turn a poorly structured pack into an unauditable run.
For benchmark creators: author repo patch or terminal task packs; the compiler turns manifest, row, and asset structure into enforced visibility lanes at run time.For security reviewers: inspect the active family contracts, then trace how the framework materializes resources, isolates candidate production, and runs verification from trusted code paths.
Contracts for agentic benchmark packs
Authors design against repo patch and terminal task contracts (visibility maps, asset roots, harness policy) instead of reinventing security rules for every benchmark.
- Agentic contracts
- The active repo patch and terminal task contracts define row schemas, eval visibility, and authoring security notes.
- Pack structure
- Manifest, JSONL, and asset roots encode public, evaluation, and hidden material with defaults that steer authors away from unsafe layouts.
- Tester YAML standard
- Harness selection, verification policy, and egress rules live in a shared schema, not a one-off per benchmark.
Canonical standards for tester YAML and security policy ship in the SecureBench repo. The Benchmark Packs and Tester Configuration guides expand them with architecture context here.
Documentation
Three guides cover the first run, pack authoring, and security review. Use the rest for architecture, reference, and extension.
candidates.jsonl). Active families are repo_patch and terminal_task.Runtime enforcement at compile time and on disk
The framework compiles contract-shaped packs into SecureBenchTask objects, runs candidate production in sandboxed harnesses, and verifies from trusted paths. Visibility lanes are enforced when tasks compile and when resources materialize, not only described in docs.
| Typical approach | SecureBench |
|---|---|
| Each benchmark invents its own layout and security posture | Agentic family contracts define safe pack, row, and tester defaults |
| Runners execute tasks however authors wired them | Compiler maps eval fields to visibility lanes per family contract |
| One process reads prompts and scores answers | Separate harness (producer) and verifier phases |
| Shell and network access are often unrestricted | Docker sandboxes default to no network; optional allowlisted egress |
publicPrompts, instructions, and assets the harness may expose to candidate production.evaluation_inputsTests and checkers staged for verification; never included in agent payloads.hiddenGold patches, expected state, and trusted analysis data reserved for verifier code paths.
Agent phase
- Public payload only
- Harness sandbox with path policy
- Candidate extraction, no verifier scripts
Verifier phase
- Hidden and evaluation inputs available
- Fresh verifier sandbox per check
- Trusted checker code writes the score
Run a smoke benchmark, then deepen the standards
Start with Getting Started, author against the active family contracts, and open the security model when you need the full threat and visibility picture.