Snapshot Explained

A snapshot is Agent911’s core output: a unified, point-in-time picture of your agent system’s reliability state. It’s designed to answer one question:

“What is happening right now, and what should I do next?”

Anatomy of a Snapshot

A snapshot has five sections:

Agent911 Snapshot
=================
Generated: 2026-03-15 13:44:02 UTC
Agent(s):  support-agent-prod, pipeline-agent-01

① Health Summary       ② Anomaly Correlation
③ Governance Status    ④ Recovery Readiness
⑤ Recommended Playbook

① Health Summary

The current liveness and behavioral state of each monitored agent:

Health Summary
──────────────
support-agent-prod      DEGRADED   (stall detected, 4m ago)
pipeline-agent-01       HEALTHY    (last heartbeat: 12s ago)
radcheck-worker         HEALTHY    (last scan: 6 min ago)

Each agent shows:

HEALTHY — active, progressing, heartbeat current
DEGRADED — running but showing risk signals
STALLED — no progress detected within threshold
OFFLINE — no signal received

② Anomaly Correlation

Agent911 doesn’t just list raw alerts — it groups them into correlated events:

Anomalies
─────────
[CORRELATED EVENT — HIGH CONFIDENCE]
  15:41:02  Sentinel: silence gap detected (support-agent-prod)
  15:41:15  Watchdog: missed heartbeat (support-agent-prod)
  15:42:31  Sentinel: stall confirmed (support-agent-prod)
  Correlation: STALL — single root cause suspected

[INFORMATIONAL]
  13:30:00  DriftGuard: minor behavioral delta from 24h baseline
            (within normal variance — no action required)

Correlated events reduce noise. Three separate alerts about the same underlying issue appear as one event, not three things to investigate separately.

③ Governance Status

If SphinxGate is configured, this section shows current routing policy state:

Governance (SphinxGate)
───────────────────────
Active policy:    production-v2
Provider usage:   openai/gpt-4o (primary) — ALLOWED
                  anthropic/claude-sonnet (fallback) — ALLOWED
Last routing:     15:44:01 UTC — openai/gpt-4o
Audit log:        /var/log/acme/sphinxgate/routing-2026-03-15.log

If SphinxGate is not configured, this section shows NOT CONFIGURED.

④ Recovery Readiness

If Lazarus is configured, this section shows your current backup posture:

Recovery Readiness (Lazarus)
────────────────────────────
Last readiness check:    2026-03-15 06:00:00 UTC
Overall readiness:       READY (4/4 surfaces verified)

Surfaces:
  Agent config files     ✓ BACKED UP   (2h ago)
  Session state          ✓ BACKED UP   (2h ago)
  Tool configurations    ✓ BACKED UP   (2h ago)
  Provider credentials   ✓ BACKED UP   (2h ago)

If Lazarus reports a surface as NOT VERIFIED, resolve it before attempting recovery. Unverified backups may not restore cleanly.

⑤ Recommended Playbook

Based on the anomaly correlation, Agent911 recommends the appropriate recovery playbook:

Recommended Playbook: STALL_RECOVERY_v2
────────────────────────────────────────
Confidence: 91% (stall pattern, single agent)

Step 1: Confirm Sentinel alert context (see: Anomalies section above)
Step 2: Check external API status for affected agent
Step 3: Run `acme radcheck --agent support-agent-prod`
Step 4: Verify Lazarus readiness (status: READY ✓)
Step 5: Restart agent: `acme agent restart support-agent-prod`
Step 6: Confirm Watchdog liveness within 60s post-restart
Step 7: Monitor Sentinel for recurrence over next 15 minutes

Snapshot Freshness

Snapshots reflect system state at the moment they’re generated. Signals older than 5 minutes are marked [STALE]. For incidents in progress, regenerate frequently:

# Refresh snapshot
acme agent911 snapshot

# Auto-refresh every 30 seconds
acme agent911 snapshot --watch --interval 30

Exporting Snapshots

Snapshots can be exported as proof bundles for compliance, post-incident review, or support escalation:

# Export current snapshot as proof bundle
acme agent911 bundle --output incident-$(date +%Y%m%d-%H%M%S).tar.gz

# Export with full log context
acme agent911 bundle --verbose --output full-incident.tar.gz

A proof bundle includes:

The snapshot JSON
All referenced log excerpts
Correlation analysis output
Lazarus readiness report (if configured)
Governance audit entries (if SphinxGate is configured)

Snapshot via CLI

# Human-readable snapshot
acme agent911 snapshot

# JSON output (for scripting or CI)
acme agent911 snapshot --format json

# Specific agent only
acme agent911 snapshot --agent <name>

# Save to file
acme agent911 snapshot --output snapshot.json

Agent911 — Snapshot Explained

Snapshot Explained

Anatomy of a Snapshot

① Health Summary

② Anomaly Correlation

③ Governance Status

④ Recovery Readiness

⑤ Recommended Playbook

Snapshot Freshness

Exporting Snapshots

Snapshot via CLI

Next Steps

Agent911 Overview

Lazarus

​Snapshot Explained

​Anatomy of a Snapshot

​① Health Summary

​② Anomaly Correlation

​③ Governance Status

​④ Recovery Readiness

​⑤ Recommended Playbook

​Snapshot Freshness

​Exporting Snapshots

​Snapshot via CLI

​Next Steps

Agent911 Overview

Lazarus

Snapshot Explained

Anatomy of a Snapshot

① Health Summary

② Anomaly Correlation

③ Governance Status

④ Recovery Readiness

⑤ Recommended Playbook

Snapshot Freshness

Exporting Snapshots

Snapshot via CLI

Next Steps