The Reliability Stack

Agent reliability isn’t a single problem — it’s a lifecycle. ACME builds tools across that lifecycle: detecting risk before it becomes failure, protecting against failure at runtime, governing routing decisions, and supporting recovery when incidents occur.

The Lifecycle

DETECT ──▶ PROTECT ──▶ GOVERN ──▶ TRIAGE ──▶ RECOVER ──▶ OPERATE

Each phase has specific tooling. Most teams don’t need everything at once — the stack is designed to be adopted incrementally, starting with free detection tools.

Phase by Phase

Detect — Know your risk posture

RadCheck — Free, read-only scan Before you can protect against reliability failures, you need to know where the risk is. RadCheck performs a fast, non-destructive scan and produces a 0–100 reliability score across four domains: stall risk, silence gaps, compaction pressure, and operational hygiene. Run it before deploying. Run it after incidents. Run it periodically as part of reliability hygiene.

Protect — Catch failures as they happen

Sentinel — Paid continuous monitoring RadCheck is a point-in-time scan. Sentinel watches continuously, catching the silent failures and stalls that don’t crash but quietly stop your workflows. It runs as a lightweight observer alongside your agents — no interference, just detection. Watchdog — Heartbeat and progress verification The more precise sibling to Sentinel. Watchdog distinguishes between an agent that’s alive and an agent that’s progressing — a distinction that matters when agents get stuck rather than crash. Pair with the heartbeat SDK for maximum fidelity.

Govern — Keep routing disciplined

SphinxGate — Paid policy enforcement As routing complexity grows — multiple models, fallbacks, contexts — SphinxGate keeps it deterministic and auditable. Every routing decision is policy-driven, logged, and provable. Essential for teams with compliance requirements or multi-provider setups.

Triage — Understand what went wrong

OCTriage — Free incident assessment When something breaks, OCTriage is your first tool. It collects context, classifies the failure pattern, and gives you a concrete next step — along with an evidence bundle for post-incident review or support escalation.

Recover — Verify you can get back up

Lazarus — Free recovery readiness Most teams discover their backups don’t work during an actual incident. Lazarus validates recovery readiness proactively: backup coverage, restore assumption validation, and readiness scoring before failure forces the test.

Operate — Long-horizon visibility

DriftGuard — Long-horizon drift detection The problems that don’t announce themselves. DriftGuard tracks behavioral baselines and flags slow erosion — context growth, latency trends, output consistency — before they cross into active failure. Agent911 — Paid unified control plane The operator surface that aggregates everything. Health signals, anomaly detections, routing state, recovery readiness — all in one snapshot. When something goes wrong, Agent911 is where you go. FindMyAgent — Fleet presence visibility (included with Agent911) Live presence, heartbeat age, and “needs attention” flags for every agent in your fleet. Included with Agent911.

Infrastructure

Transmission — Event coherence layer The reliability infrastructure for reliability signals. Transmission ensures events flow correctly between components — ordered, deduplicated, and delivered even under degraded conditions.

How Tools Connect

┌─────────────────────────────────────────────────────┐
│                     AGENT911                         │
│            Unified Operator Surface                  │
└─────────────────┬───────────────┬───────────────────┘
                  │               │
        ┌─────────▼──┐    ┌───────▼───────┐
        │ TRANSMISSION│    │   LAZARUS     │
        │  (event bus)│    │ (readiness)   │
        └─────┬───────┘    └───────────────┘
              │
    ┌─────────┼──────────┬────────────┐
    ▼         ▼          ▼            ▼
SENTINEL  WATCHDOG  SPHINXGATE  DRIFTGUARD
(runtime) (progress) (routing)   (drift)
    │
    ▼
 RADCHECK   OCTRIAGE   LAZARUS
(scan-time) (triage)   (recovery)

Adoption Path

Most teams start small and add coverage as their agent fleet grows:

Solo operator / single agent

RadCheck — Get your first reliability score (free, takes 5 minutes)
OCTriage — Have it ready for when incidents happen (free)
Sentinel — Add when you want continuous protection

Small team / multiple agents

Add to the above: 4. Watchdog — Progress verification beyond what Sentinel observes 5. Agent911 — Unified view when you’re managing more than a couple of agents 6. Lazarus — Validate recovery readiness proactively

Multi-model / regulated environment

Add to the above: 7. SphinxGate — Deterministic routing with audit trail 8. DriftGuard — Long-horizon behavioral baseline tracking

Free vs. Paid

Product	Tier
RadCheck	Free
OCTriage	Free
Lazarus	Free
Sentinel	Paid
Agent911 + FindMyAgent	Paid
SphinxGate	Paid
Watchdog	Paid
DriftGuard	Paid
Transmission	Included with paid products

Free tools build trust. Paid tools solve the problem at scale. See Pricing for current rates.

The Reliability Stack

The Reliability Stack

The Lifecycle

Phase by Phase

Detect — Know your risk posture

Protect — Catch failures as they happen

Govern — Keep routing disciplined

Triage — Understand what went wrong

Recover — Verify you can get back up

Operate — Long-horizon visibility

Infrastructure

How Tools Connect

Adoption Path

Solo operator / single agent

Small team / multiple agents

Multi-model / regulated environment

Free vs. Paid

Next Steps

5-Minute Quickstart

RadCheck

​The Reliability Stack

​The Lifecycle

​Phase by Phase

​Detect — Know your risk posture

​Protect — Catch failures as they happen

​Govern — Keep routing disciplined

​Triage — Understand what went wrong

​Recover — Verify you can get back up

​Operate — Long-horizon visibility

​Infrastructure

​How Tools Connect

​Adoption Path

​Solo operator / single agent

​Small team / multiple agents

​Multi-model / regulated environment

​Free vs. Paid

​Next Steps

5-Minute Quickstart

RadCheck

The Reliability Stack

The Lifecycle

Phase by Phase

Detect — Know your risk posture

Protect — Catch failures as they happen

Govern — Keep routing disciplined

Triage — Understand what went wrong

Recover — Verify you can get back up

Operate — Long-horizon visibility

Infrastructure

How Tools Connect

Adoption Path

Solo operator / single agent

Small team / multiple agents

Multi-model / regulated environment

Free vs. Paid

Next Steps