Skip to main content

The Reliability Stack

Agent reliability isn’t a single problem — it’s a lifecycle. ACME builds tools across that lifecycle: detecting risk before it becomes failure, protecting against failure at runtime, governing routing decisions, and supporting recovery when incidents occur.

The Lifecycle

DETECT ──▶ PROTECT ──▶ GOVERN ──▶ TRIAGE ──▶ RECOVER ──▶ OPERATE
Each phase has specific tooling. Most teams don’t need everything at once — the stack is designed to be adopted incrementally, starting with free detection tools.

Phase by Phase

Detect — Know your risk posture

RadCheck — Free, read-only scan Before you can protect against reliability failures, you need to know where the risk is. RadCheck performs a fast, non-destructive scan and produces a 0–100 reliability score across four domains: stall risk, silence gaps, compaction pressure, and operational hygiene. Run it before deploying. Run it after incidents. Run it periodically as part of reliability hygiene.

Protect — Catch failures as they happen

Sentinel — Paid continuous monitoring RadCheck is a point-in-time scan. Sentinel watches continuously, catching the silent failures and stalls that don’t crash but quietly stop your workflows. It runs as a lightweight observer alongside your agents — no interference, just detection. Watchdog — Heartbeat and progress verification The more precise sibling to Sentinel. Watchdog distinguishes between an agent that’s alive and an agent that’s progressing — a distinction that matters when agents get stuck rather than crash. Pair with the heartbeat SDK for maximum fidelity.

Govern — Keep routing disciplined

SphinxGate — Paid policy enforcement As routing complexity grows — multiple models, fallbacks, contexts — SphinxGate keeps it deterministic and auditable. Every routing decision is policy-driven, logged, and provable. Essential for teams with compliance requirements or multi-provider setups.

Triage — Understand what went wrong

OCTriage — Free incident assessment When something breaks, OCTriage is your first tool. It collects context, classifies the failure pattern, and gives you a concrete next step — along with an evidence bundle for post-incident review or support escalation.

Recover — Verify you can get back up

Lazarus — Free recovery readiness Most teams discover their backups don’t work during an actual incident. Lazarus validates recovery readiness proactively: backup coverage, restore assumption validation, and readiness scoring before failure forces the test.

Operate — Long-horizon visibility

DriftGuard — Long-horizon drift detection The problems that don’t announce themselves. DriftGuard tracks behavioral baselines and flags slow erosion — context growth, latency trends, output consistency — before they cross into active failure. Agent911 — Paid unified control plane The operator surface that aggregates everything. Health signals, anomaly detections, routing state, recovery readiness — all in one snapshot. When something goes wrong, Agent911 is where you go. FindMyAgent — Fleet presence visibility (included with Agent911) Live presence, heartbeat age, and “needs attention” flags for every agent in your fleet. Included with Agent911.

Infrastructure

Transmission — Event coherence layer The reliability infrastructure for reliability signals. Transmission ensures events flow correctly between components — ordered, deduplicated, and delivered even under degraded conditions.

How Tools Connect

┌─────────────────────────────────────────────────────┐
│                     AGENT911                         │
│            Unified Operator Surface                  │
└─────────────────┬───────────────┬───────────────────┘
                  │               │
        ┌─────────▼──┐    ┌───────▼───────┐
        │ TRANSMISSION│    │   LAZARUS     │
        │  (event bus)│    │ (readiness)   │
        └─────┬───────┘    └───────────────┘

    ┌─────────┼──────────┬────────────┐
    ▼         ▼          ▼            ▼
SENTINEL  WATCHDOG  SPHINXGATE  DRIFTGUARD
(runtime) (progress) (routing)   (drift)


 RADCHECK   OCTRIAGE   LAZARUS
(scan-time) (triage)   (recovery)

Adoption Path

Most teams start small and add coverage as their agent fleet grows:

Solo operator / single agent

  1. RadCheck — Get your first reliability score (free, takes 5 minutes)
  2. OCTriage — Have it ready for when incidents happen (free)
  3. Sentinel — Add when you want continuous protection

Small team / multiple agents

Add to the above: 4. Watchdog — Progress verification beyond what Sentinel observes 5. Agent911 — Unified view when you’re managing more than a couple of agents 6. Lazarus — Validate recovery readiness proactively

Multi-model / regulated environment

Add to the above: 7. SphinxGate — Deterministic routing with audit trail 8. DriftGuard — Long-horizon behavioral baseline tracking

Free vs. Paid

ProductTier
RadCheckFree
OCTriageFree
LazarusFree
SentinelPaid
Agent911 + FindMyAgentPaid
SphinxGatePaid
WatchdogPaid
DriftGuardPaid
TransmissionIncluded with paid products
Free tools build trust. Paid tools solve the problem at scale. See Pricing for current rates.

Next Steps