Skip to main content

Platform Overview

ACME is a reliability stack for AI agent operators. The products aren’t a bundle of unrelated tools — they form a coherent system designed around how agent failures actually happen and how operators actually respond to them.

The Reliability Loop

Agent reliability follows a loop. ACME tools map to each phase:
┌──────────────────────────────────────────────────────────────┐
│                                                              │
│   ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌───────┐ │
│   │  DETECT  │───▶│  TRIAGE  │───▶│ RECOVER  │───▶│VERIFY │ │
│   └──────────┘    └──────────┘    └──────────┘    └───────┘ │
│        │               │               │               │    │
│   RadCheck         OCTriage        Agent911         Lazarus  │
│   Sentinel         DriftGuard      Watchdog                  │
│   Watchdog         SphinxGate                               │
│   FindMyAgent                                               │
│                                                              │
└──────────────────────────────────────────────────────────────┘
                    ▲                               │
                    └───────────────────────────────┘
                         Transmission (signal bus)
Every tool has a defined place in this loop. No tool overlaps with another in responsibility. Each handoff is explicit.

Layer by Layer

Detect

Tools that identify problems — before users do.
ToolWhat It Detects
RadCheckPoint-in-time reliability risk (0–100 score)
SentinelReal-time stalls, silence gaps, runtime anomalies
WatchdogMissed heartbeats, liveness failures, throughput collapse
DriftGuardLong-horizon behavioral drift across sessions
FindMyAgentFleet presence — who’s up, who’s stalled, who needs attention
SphinxGateUnauthorized routing, policy violations, audit anomalies

Triage

Tools that help you understand what happened and what to do.
ToolWhat It Classifies
OCTriageIncident type, root cause, evidence bundle, next steps
DriftGuardDrift patterns and memory integrity issues

Recover

Tools that coordinate and guide recovery.
ToolWhat It Provides
Agent911Unified control plane — telemetry, playbooks, proof bundles
WatchdogEscalation and handoff to recovery workflow

Verify

Tools that confirm recovery was actually successful.
ToolWhat It Verifies
LazarusRecovery readiness — can your system actually restore?

Signal Bus

ToolRole
TransmissionMoves reliability signals between all components with delivery guarantees

Common Deployment Patterns

Starting Out: Free Baseline

For teams just getting started with agent reliability.
RadCheck → OCTriage → Lazarus
  • Run RadCheck to understand your current state
  • Use OCTriage when incidents happen
  • Run Lazarus to verify you can recover
Cost: Free.

Production Protection: Core Stack

For teams running production agents that need runtime protection.
Sentinel + Watchdog → Agent911 + FindMyAgent
       ↓                      ↓
  Detection layer      Response layer
  • Sentinel catches runtime anomalies continuously
  • Watchdog verifies liveness and escalates
  • Agent911 provides the response surface with guided playbooks
  • FindMyAgent gives fleet-wide visibility
Best for: Teams with 2+ production agents, recurring incidents, or 24/7 uptime requirements.

Multi-Model Governance: Add SphinxGate

For teams using multiple models or with compliance requirements.
Core Stack + SphinxGate + Transmission
  • SphinxGate enforces routing policy and maintains audit trail
  • Transmission ensures all signals reach Agent911 reliably
Best for: Teams using fallback chains, multiple providers, or regulated AI usage.

Full Stack: Operator Bundle

All tools. Full coverage across detection, triage, routing, recovery, and verification. See the Operator Bundle for details and pricing.

Design Principles

Observe, don’t interfere

ACME tools watch your systems. They do not autonomously modify agent behavior, restart processes, or take recovery actions without operator direction. Every action that changes system state is operator-initiated. This is intentional. Autonomy without understanding creates more incidents, not fewer.

Evidence-first

Every detection includes evidence. Every incident includes a proof bundle. Every recovery includes a readiness check. The goal is that operators always know why a tool is telling them something, not just what.

Deterministic playbooks

Recovery shouldn’t depend on which team member is on-call. Agent911 playbooks give the same guidance to everyone, regardless of familiarity with the system. Same incident type → same playbook → same recovery path.

Trust-first free tier

RadCheck, OCTriage, and Lazarus are free. These tools build the foundation of trust — you understand your system before you buy anything. Paid tools solve the problem at scale.

What ACME Is Not

Understanding what we don’t do is as important as understanding what we do.
ACME is not:
  • An orchestration framework (we watch your agents, not direct them)
  • An autonomous healing system (operators always act)
  • A replacement for logging/APM (we’re the reliability layer on top)
  • Agent-specific (we work with OpenClaw, LangChain, AutoGPT, custom systems)

Getting Started