PRODUCT · BENCH & AUDIT · Q3 2026 PREVIEW
Stop trusting the eval numbers on a model card. Stop hoping the upstream maintainer didn't ship a sleeper agent. Run any frontier model through a frozen benchmark harness or a security audit pipeline; get a signed report you can put in front of an ATO board.
the problem
When a vendor advertises 87.4% on MMLU, you have no way to reproduce it. Different prompt format, different sampling parameters, different harness version, and the number changes by 3-5 points. When a frontier model is published, it ships with no security review at all — no representation engineering scan, no backdoor probe, no contamination check. Procurement teams approve models on vibes. We're closing both gaps.
two services in one product line
Both run on Apex GPU infrastructure. Both produce signed reports anchored to the rekor log. Both ship with the full execution trace so you can rerun any check yourself.
Standard suites — MMLU, HumanEval, GSM8K, BBH, ARC, TruthfulQA — plus domain-specific (MedQA, RoboArena, GAUSS-magnetometry). Frozen prompts, frozen sampling parameters, frozen seed. Identical inputs across every model. Different runs of the same model give identical outputs.
Every benchmark result paired with the harness commit hash, the sampling config, the prompt set hash, and the seed. Rekor index records when the eval ran and what version of every dependency was loaded. Reproducible by anyone with the same hash chain.
Side-by-side numbers for any pair of models on any benchmark, with confidence intervals and failure analysis. The eval-report-as-API. Pulls into your model selection pipeline directly.
Activation-based detection of trigger patterns. Representation engineering probes for hidden behaviors. Targeted prompt sweeps drawn from the BackdoorLLM catalog. Findings ranked by severity with reproduction steps.
Adversarial prompt sweeps probing for sleeper-agent behaviors — model that misbehaves only on specific dates, specific contexts, or specific input patterns. Catches the Anthropic-style sleeper agents that survive standard fine-tuning.
Compare published weights to the chain of derivative claims on the model card. Detect undeclared base models, undisclosed fine-tuning data signatures, and inconsistencies between the architecture spec and the actual weight tensors.
"Most procurement teams approve models on vibes. We're handing them signed evidence."— APEX ENGINEERING, ON THE BENCH & AUDIT PIPELINE
at the numbers
pricing
Pay only for the evals and audits you run. Subscription tier covers ongoing recertification as models are updated.
Single benchmark suite, single model, signed report. Good for one-off model selection.
Full security audit: all 7 categories, comprehensive report, mitigation recommendations.
Unlimited evals + audits across the curated catalog. Includes recertification on every model update.
First 10 customers get charter pricing locked for two years and direct input into the audit category roadmap. Email us with the model you'd most like audited; we'll set up a private preview.