Platform Overview

ACME is a reliability stack for AI agent operators. The products aren’t a bundle of unrelated tools — they form a coherent system designed around how agent failures actually happen and how operators actually respond to them.

The Reliability Loop

Agent reliability follows a loop. ACME tools map to each phase:

┌──────────────────────────────────────────────────────────────┐
│                                                              │
│   ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌───────┐ │
│   │  DETECT  │───▶│  TRIAGE  │───▶│ RECOVER  │───▶│VERIFY │ │
│   └──────────┘    └──────────┘    └──────────┘    └───────┘ │
│        │               │               │               │    │
│   RadCheck         OCTriage        Agent911         Lazarus  │
│   Sentinel         DriftGuard      Watchdog                  │
│   Watchdog         SphinxGate                               │
│   FindMyAgent                                               │
│                                                              │
└──────────────────────────────────────────────────────────────┘
                    ▲                               │
                    └───────────────────────────────┘
                         Transmission (signal bus)

Every tool has a defined place in this loop. No tool overlaps with another in responsibility. Each handoff is explicit.

Layer by Layer

Detect

Tools that identify problems — before users do.

Tool	What It Detects
RadCheck	Point-in-time reliability risk (0–100 score)
Sentinel	Real-time stalls, silence gaps, runtime anomalies
Watchdog	Missed heartbeats, liveness failures, throughput collapse
DriftGuard	Long-horizon behavioral drift across sessions
FindMyAgent	Fleet presence — who’s up, who’s stalled, who needs attention
SphinxGate	Unauthorized routing, policy violations, audit anomalies

Triage

Tools that help you understand what happened and what to do.

Tool	What It Classifies
OCTriage	Incident type, root cause, evidence bundle, next steps
DriftGuard	Drift patterns and memory integrity issues

Recover

Tools that coordinate and guide recovery.

Tool	What It Provides
Agent911	Unified control plane — telemetry, playbooks, proof bundles
Watchdog	Escalation and handoff to recovery workflow

Verify

Tools that confirm recovery was actually successful.

Tool	What It Verifies
Lazarus	Recovery readiness — can your system actually restore?

Signal Bus

Tool	Role
Transmission	Moves reliability signals between all components with delivery guarantees

Common Deployment Patterns

Starting Out: Free Baseline

For teams just getting started with agent reliability.

RadCheck → OCTriage → Lazarus

Run RadCheck to understand your current state
Use OCTriage when incidents happen
Run Lazarus to verify you can recover

Cost: Free.

Production Protection: Core Stack

For teams running production agents that need runtime protection.

Sentinel + Watchdog → Agent911 + FindMyAgent
       ↓                      ↓
  Detection layer      Response layer

Sentinel catches runtime anomalies continuously
Watchdog verifies liveness and escalates
Agent911 provides the response surface with guided playbooks
FindMyAgent gives fleet-wide visibility

Best for: Teams with 2+ production agents, recurring incidents, or 24/7 uptime requirements.

Multi-Model Governance: Add SphinxGate

For teams using multiple models or with compliance requirements.

Core Stack + SphinxGate + Transmission

SphinxGate enforces routing policy and maintains audit trail
Transmission ensures all signals reach Agent911 reliably

Best for: Teams using fallback chains, multiple providers, or regulated AI usage.

Full Stack: Operator Bundle

All tools. Full coverage across detection, triage, routing, recovery, and verification. See the Operator Bundle for details and pricing.

Design Principles

Observe, don’t interfere

ACME tools watch your systems. They do not autonomously modify agent behavior, restart processes, or take recovery actions without operator direction. Every action that changes system state is operator-initiated. This is intentional. Autonomy without understanding creates more incidents, not fewer.

Evidence-first

Every detection includes evidence. Every incident includes a proof bundle. Every recovery includes a readiness check. The goal is that operators always know why a tool is telling them something, not just what.

Deterministic playbooks

Recovery shouldn’t depend on which team member is on-call. Agent911 playbooks give the same guidance to everyone, regardless of familiarity with the system. Same incident type → same playbook → same recovery path.

Trust-first free tier

RadCheck, OCTriage, and Lazarus are free. These tools build the foundation of trust — you understand your system before you buy anything. Paid tools solve the problem at scale.

What ACME Is Not

Understanding what we don’t do is as important as understanding what we do.

ACME is not:

An orchestration framework (we watch your agents, not direct them)
An autonomous healing system (operators always act)
A replacement for logging/APM (we’re the reliability layer on top)
Agent-specific (we work with OpenClaw, LangChain, AutoGPT, custom systems)

Getting Started

5-Minute Quickstart

Get RadCheck running and your first reliability score.

All Products

Browse every tool with descriptions and pricing.

Pricing

See free vs. paid options and bundles.

Support

Get help. We respond fast.

Getting Started

Free Tools

Runtime Protection

Control & Recovery

Governance

Architecture

Support

Platform Overview

Platform Overview

The Reliability Loop

Layer by Layer

Detect

Triage

Recover

Verify

Signal Bus

Common Deployment Patterns

Starting Out: Free Baseline

Production Protection: Core Stack

Multi-Model Governance: Add SphinxGate

Full Stack: Operator Bundle

Design Principles

Observe, don’t interfere

Evidence-first

Deterministic playbooks

Trust-first free tier

What ACME Is Not

Getting Started

5-Minute Quickstart

All Products

Pricing

Support

Getting Started

Free Tools

Runtime Protection

Control & Recovery

Governance

Architecture

Support

Documentation Index

​Platform Overview

​The Reliability Loop

​Layer by Layer

​Detect

​Triage

​Recover

​Verify

​Signal Bus

​Common Deployment Patterns

​Starting Out: Free Baseline

​Production Protection: Core Stack

​Multi-Model Governance: Add SphinxGate

​Full Stack: Operator Bundle

​Design Principles

​Observe, don’t interfere

​Evidence-first

​Deterministic playbooks

​Trust-first free tier

​What ACME Is Not

​Getting Started

5-Minute Quickstart

All Products

Pricing

Support

Platform Overview

The Reliability Loop

Layer by Layer

Detect

Triage

Recover

Verify

Signal Bus

Common Deployment Patterns

Starting Out: Free Baseline

Production Protection: Core Stack

Multi-Model Governance: Add SphinxGate

Full Stack: Operator Bundle

Design Principles

Observe, don’t interfere

Evidence-first

Deterministic playbooks

Trust-first free tier

What ACME Is Not

Getting Started