PRODUCT · BENCH & AUDIT · Q3 2026 PREVIEW

Reproducible eval truth. Backdoor-aware audit. Both signed.

Stop trusting the eval numbers on a model card. Stop hoping the upstream maintainer didn't ship a sleeper agent. Run any frontier model through a frozen benchmark harness or a security audit pipeline; get a signed report you can put in front of an ATO board.

Join the preview → How it works

the problem

Eval numbers on model cards are anonymous claims. Backdoor scans don't exist for most models. You're flying blind.

When a vendor advertises 87.4% on MMLU, you have no way to reproduce it. Different prompt format, different sampling parameters, different harness version, and the number changes by 3-5 points. When a frontier model is published, it ships with no security review at all — no representation engineering scan, no backdoor probe, no contamination check. Procurement teams approve models on vibes. We're closing both gaps.

two services in one product line

Bench: reproducible eval. Audit: security review.

Both run on Apex GPU infrastructure. Both produce signed reports anchored to the rekor log. Both ship with the full execution trace so you can rerun any check yourself.

Bench: frozen harness

Standard suites — MMLU, HumanEval, GSM8K, BBH, ARC, TruthfulQA — plus domain-specific (MedQA, RoboArena, GAUSS-magnetometry). Frozen prompts, frozen sampling parameters, frozen seed. Identical inputs across every model. Different runs of the same model give identical outputs.

Bench: signed report

Every benchmark result paired with the harness commit hash, the sampling config, the prompt set hash, and the seed. Rekor index records when the eval ran and what version of every dependency was loaded. Reproducible by anyone with the same hash chain.

Bench: comparison view

Side-by-side numbers for any pair of models on any benchmark, with confidence intervals and failure analysis. The eval-report-as-API. Pulls into your model selection pipeline directly.

Audit: backdoor scanning

Activation-based detection of trigger patterns. Representation engineering probes for hidden behaviors. Targeted prompt sweeps drawn from the BackdoorLLM catalog. Findings ranked by severity with reproduction steps.

Audit: sleeper-agent probe

Adversarial prompt sweeps probing for sleeper-agent behaviors — model that misbehaves only on specific dates, specific contexts, or specific input patterns. Catches the Anthropic-style sleeper agents that survive standard fine-tuning.

Audit: weight provenance

Compare published weights to the chain of derivative claims on the model card. Detect undeclared base models, undisclosed fine-tuning data signatures, and inconsistencies between the architecture spec and the actual weight tensors.

"Most procurement teams approve models on vibes. We're handing them signed evidence."

— APEX ENGINEERING, ON THE BENCH & AUDIT PIPELINE

at the numbers

Designed to slot into procurement, not academic research.

benchmark suites at launch

General reasoning, code, math, safety, plus domain suites for medical and robotics.

audit categories

Backdoors, sleeper agents, contamination, prompt injection, jailbreak, weight integrity, license drift.

100%

reproducible runs

Frozen seed, frozen harness commit, frozen prompt hash. Anyone with the chain can rerun.

48 hr

standard turnaround

From submission to signed report. Rush turnaround available for procurement deadlines.

pricing

Per-eval and per-audit. Volume discounts for the catalog.

Pay only for the evals and audits you run. Subscription tier covers ongoing recertification as models are updated.

A la carte

$2K/ eval

Single benchmark suite, single model, signed report. Good for one-off model selection.

Choice of benchmark suite
Reproducible signed report
Rekor anchor for the run
48-hour turnaround

Inquire

Audit Bundle

$15K/ model

Full security audit: all 7 categories, comprehensive report, mitigation recommendations.

All 7 audit categories
Backdoor + sleeper-agent scan
Contamination + provenance checks
Prompt-injection & jailbreak suite
Mitigation recommendations
30-day re-audit at no charge

Talk to sales

Subscription

$120K/ yr

Unlimited evals + audits across the curated catalog. Includes recertification on every model update.

Unlimited evals on Top 100
10 audit bundles per year
Auto-recert on model updates
API access to eval database
Custom benchmark onboarding
Dedicated TAM

Q3 2026 preview. Lock in early-customer pricing now.

First 10 customers get charter pricing locked for two years and direct input into the audit category roadmap. Email us with the model you'd most like audited; we'll set up a private preview.

Request preview access → Other products