Agent911
Agent911 is ACME’s unified control plane for agent reliability. When something goes wrong — or before it does — Agent911 gives you a single surface to understand system state, trace the cause, and follow a deterministic path to resolution.The Problem Agent911 Solves
As your agent fleet grows, incidents get harder to diagnose. You have logs from five places, heartbeats from three tools, and no unified picture of what’s actually happening. When something breaks at 2am:- Which agent is affected?
- What signals are available right now?
- What should I do first?
What Agent911 Does
Unified Telemetry Snapshot
Aggregates signals from 7 sources into a single view of system state.
Guided Recovery Playbooks
Deterministic, step-by-step operator guidance for each incident type.
Proof Bundles
Exportable evidence artifacts for post-incident review and compliance.
Incident Timeline
Reconstructed timeline of what happened, when, and in what order.
How It Works
Agent911 aggregates reliability signals from across the ACME stack — and from your existing infrastructure — into a coherent operator view:What Agent911 Does NOT Do
This list is intentional. Trust requires precision.
- Automatically restart or repair agents
- Modify configurations without explicit operator action
- Make autonomous decisions about recovery
- Replace the need for an operator on-call
Telemetry Sources
Agent911 v0.1 aggregates from 7 signal sources:| Source | Signal Type |
|---|---|
| Sentinel | Runtime anomalies, stall detection |
| Watchdog | Heartbeat / liveness probe results |
| OCTriage | Incident classification, evidence bundles |
| DriftGuard | Session-to-session behavioral delta |
| RadCheck | Baseline reliability score history |
| Lazarus | Recovery readiness posture |
| Native logs | Raw process/gateway logs (when available) |
Snapshot: What You See
When you open Agent911, you see a snapshot — a coherent view of system state at a point in time:- Current health signals across all agents
- Recent anomalies and their correlations
- Routing/governance status (if SphinxGate is configured)
- Recovery readiness posture (if Lazarus is configured)
What does 'coherent view' actually mean?
What does 'coherent view' actually mean?
A snapshot doesn’t just show raw signals — it contextualizes them. If Sentinel detected a stall 4 minutes ago and Watchdog shows a missed heartbeat at the same time, Agent911 surfaces these as a correlated event, not two separate alerts.