PRODUCT · BENCH & AUDIT · Q3 2026 PREVIEW

Reproducible eval truth. Backdoor-aware audit. Both signed.

Stop trusting the eval numbers on a model card. Stop hoping the upstream maintainer didn't ship a sleeper agent. Run any frontier model through a frozen benchmark harness or a security audit pipeline; get a signed report you can put in front of an ATO board.

the problem

Eval numbers on model cards are anonymous claims. Backdoor scans don't exist for most models. You're flying blind.

When a vendor advertises 87.4% on MMLU, you have no way to reproduce it. Different prompt format, different sampling parameters, different harness version, and the number changes by 3-5 points. When a frontier model is published, it ships with no security review at all — no representation engineering scan, no backdoor probe, no contamination check. Procurement teams approve models on vibes. We're closing both gaps.

two services in one product line

Bench: reproducible eval. Audit: security review.

Both run on Apex GPU infrastructure. Both produce signed reports anchored to the rekor log. Both ship with the full execution trace so you can rerun any check yourself.

Bench: frozen harness

Standard suites — MMLU, HumanEval, GSM8K, BBH, ARC, TruthfulQA — plus domain-specific (MedQA, RoboArena, GAUSS-magnetometry). Frozen prompts, frozen sampling parameters, frozen seed. Identical inputs across every model. Different runs of the same model give identical outputs.

Bench: signed report

Every benchmark result paired with the harness commit hash, the sampling config, the prompt set hash, and the seed. Rekor index records when the eval ran and what version of every dependency was loaded. Reproducible by anyone with the same hash chain.

Bench: comparison view

Side-by-side numbers for any pair of models on any benchmark, with confidence intervals and failure analysis. The eval-report-as-API. Pulls into your model selection pipeline directly.

Audit: backdoor scanning

Activation-based detection of trigger patterns. Representation engineering probes for hidden behaviors. Targeted prompt sweeps drawn from the BackdoorLLM catalog. Findings ranked by severity with reproduction steps.

Audit: sleeper-agent probe

Adversarial prompt sweeps probing for sleeper-agent behaviors — model that misbehaves only on specific dates, specific contexts, or specific input patterns. Catches the Anthropic-style sleeper agents that survive standard fine-tuning.

Audit: weight provenance

Compare published weights to the chain of derivative claims on the model card. Detect undeclared base models, undisclosed fine-tuning data signatures, and inconsistencies between the architecture spec and the actual weight tensors.

"Most procurement teams approve models on vibes. We're handing them signed evidence."
— APEX ENGINEERING, ON THE BENCH & AUDIT PIPELINE

at the numbers

Designed to slot into procurement, not academic research.

14
benchmark suites at launch
General reasoning, code, math, safety, plus domain suites for medical and robotics.
7
audit categories
Backdoors, sleeper agents, contamination, prompt injection, jailbreak, weight integrity, license drift.
100%
reproducible runs
Frozen seed, frozen harness commit, frozen prompt hash. Anyone with the chain can rerun.
48 hr
standard turnaround
From submission to signed report. Rush turnaround available for procurement deadlines.

pricing

Per-eval and per-audit. Volume discounts for the catalog.

Pay only for the evals and audits you run. Subscription tier covers ongoing recertification as models are updated.

A la carte
$2K/ eval

Single benchmark suite, single model, signed report. Good for one-off model selection.

  • Choice of benchmark suite
  • Reproducible signed report
  • Rekor anchor for the run
  • 48-hour turnaround
Inquire
Subscription
$120K/ yr

Unlimited evals + audits across the curated catalog. Includes recertification on every model update.

  • Unlimited evals on Top 100
  • 10 audit bundles per year
  • Auto-recert on model updates
  • API access to eval database
  • Custom benchmark onboarding
  • Dedicated TAM
Contact us

Q3 2026 preview. Lock in early-customer pricing now.

First 10 customers get charter pricing locked for two years and direct input into the audit category roadmap. Email us with the model you'd most like audited; we'll set up a private preview.