Skip to main content

Snapshot Explained

A snapshot is Agent911’s core output: a unified, point-in-time picture of your agent system’s reliability state. It’s designed to answer one question:
“What is happening right now, and what should I do next?”

Anatomy of a Snapshot

A snapshot has five sections:
Agent911 Snapshot
=================
Generated: 2026-03-15 13:44:02 UTC
Agent(s):  support-agent-prod, pipeline-agent-01

① Health Summary       ② Anomaly Correlation
③ Governance Status    ④ Recovery Readiness
⑤ Recommended Playbook

① Health Summary

The current liveness and behavioral state of each monitored agent:
Health Summary
──────────────
support-agent-prod      DEGRADED   (stall detected, 4m ago)
pipeline-agent-01       HEALTHY    (last heartbeat: 12s ago)
radcheck-worker         HEALTHY    (last scan: 6 min ago)
Each agent shows:
  • HEALTHY — active, progressing, heartbeat current
  • DEGRADED — running but showing risk signals
  • STALLED — no progress detected within threshold
  • OFFLINE — no signal received

② Anomaly Correlation

Agent911 doesn’t just list raw alerts — it groups them into correlated events:
Anomalies
─────────
[CORRELATED EVENT — HIGH CONFIDENCE]
  15:41:02  Sentinel: silence gap detected (support-agent-prod)
  15:41:15  Watchdog: missed heartbeat (support-agent-prod)
  15:42:31  Sentinel: stall confirmed (support-agent-prod)
  Correlation: STALL — single root cause suspected

[INFORMATIONAL]
  13:30:00  DriftGuard: minor behavioral delta from 24h baseline
            (within normal variance — no action required)
Correlated events reduce noise. Three separate alerts about the same underlying issue appear as one event, not three things to investigate separately.

③ Governance Status

If SphinxGate is configured, this section shows current routing policy state:
Governance (SphinxGate)
───────────────────────
Active policy:    production-v2
Provider usage:   openai/gpt-4o (primary) — ALLOWED
                  anthropic/claude-sonnet (fallback) — ALLOWED
Last routing:     15:44:01 UTC — openai/gpt-4o
Audit log:        /var/log/acme/sphinxgate/routing-2026-03-15.log
If SphinxGate is not configured, this section shows NOT CONFIGURED.

④ Recovery Readiness

If Lazarus is configured, this section shows your current backup posture:
Recovery Readiness (Lazarus)
────────────────────────────
Last readiness check:    2026-03-15 06:00:00 UTC
Overall readiness:       READY (4/4 surfaces verified)

Surfaces:
  Agent config files     ✓ BACKED UP   (2h ago)
  Session state          ✓ BACKED UP   (2h ago)
  Tool configurations    ✓ BACKED UP   (2h ago)
  Provider credentials   ✓ BACKED UP   (2h ago)
If Lazarus reports a surface as NOT VERIFIED, resolve it before attempting recovery. Unverified backups may not restore cleanly.
Based on the anomaly correlation, Agent911 recommends the appropriate recovery playbook:
Recommended Playbook: STALL_RECOVERY_v2
────────────────────────────────────────
Confidence: 91% (stall pattern, single agent)

Step 1: Confirm Sentinel alert context (see: Anomalies section above)
Step 2: Check external API status for affected agent
Step 3: Run `acme radcheck --agent support-agent-prod`
Step 4: Verify Lazarus readiness (status: READY ✓)
Step 5: Restart agent: `acme agent restart support-agent-prod`
Step 6: Confirm Watchdog liveness within 60s post-restart
Step 7: Monitor Sentinel for recurrence over next 15 minutes

Snapshot Freshness

Snapshots reflect system state at the moment they’re generated. Signals older than 5 minutes are marked [STALE]. For incidents in progress, regenerate frequently:
# Refresh snapshot
acme agent911 snapshot

# Auto-refresh every 30 seconds
acme agent911 snapshot --watch --interval 30

Exporting Snapshots

Snapshots can be exported as proof bundles for compliance, post-incident review, or support escalation:
# Export current snapshot as proof bundle
acme agent911 bundle --output incident-$(date +%Y%m%d-%H%M%S).tar.gz

# Export with full log context
acme agent911 bundle --verbose --output full-incident.tar.gz
A proof bundle includes:
  • The snapshot JSON
  • All referenced log excerpts
  • Correlation analysis output
  • Lazarus readiness report (if configured)
  • Governance audit entries (if SphinxGate is configured)

Snapshot via CLI

# Human-readable snapshot
acme agent911 snapshot

# JSON output (for scripting or CI)
acme agent911 snapshot --format json

# Specific agent only
acme agent911 snapshot --agent <name>

# Save to file
acme agent911 snapshot --output snapshot.json

Next Steps