Skip to main content

Watchdog

Watchdog continuously verifies that your agents are not just alive — but actually making progress. It monitors gateway and agent liveness, probes for stall signatures, and escalates before silent throughput collapse becomes an outage.

The Gap Between “Running” and “Working”

Process monitors and uptime checkers answer one question: is the process alive? That’s necessary but insufficient for agents. An agent can be:
  • Running but stalled — process is up, no work happening
  • Running but looping — executing continuously, not advancing
  • Running but silent — heartbeat looks fine, outputs have stopped
Watchdog watches for these conditions — the gap between the process being alive and the agent actually working.

What Watchdog Does

Gateway Probes

Actively probes gateway health at regular intervals. Not passive — verified.

Stall Signature Detection

Recognizes patterns consistent with execution stalls before they cause failures.

Throughput Monitoring

Tracks whether work is actually completing, not just whether the agent is active.

Escalation Pipeline

Routes detections to Sentinel, Agent911, and configured alert channels.

How Watchdog Works

Watchdog runs as a persistent background monitor with two probe types:

Liveness Probes

Active checks that the gateway and agent runtime are responsive:
Every 30s:
  → Probe gateway endpoint
  → Verify response within SLA window (default: 5s)
  → Record heartbeat age
  → Escalate if missed for N consecutive probes

Progress Probes

Passive observation that meaningful work is completing:
Continuously:
  → Monitor context update frequency
  → Track tool call completion rate
  → Watch for repeated-output patterns (loop detection)
  → Measure throughput vs. expected baseline

Detection Scope

DetectionTrigger
Missed heartbeatNo liveness response for N consecutive intervals
StallNo meaningful progress for threshold window
Gateway unresponsiveProbe timeout with no response
Silent throughput collapseOutputs dropped to zero without process failure
Loop detectedRepeated identical outputs N times in succession
Abnormal latencyResponse times exceed established baseline by threshold
Probe SLA breachGateway responded but took longer than configured SLA

Quick Start

# Start Watchdog
acme watchdog start

# Check status
acme watchdog status

# View recent detections
acme watchdog log --tail 20

Sample Output

Watchdog Monitor
================
Status: ACTIVE
Agents watched: 3

Agent Health:
  ✓ pipeline-agent       ALIVE    Heartbeat: 8s ago
  ✓ indexer-agent        ALIVE    Heartbeat: 12s ago
  ⚠ support-agent-prod   STALL    No progress: 4m 22s

Recent Detections:
  [HIGH] 15:42:09 — support-agent-prod: Stall detected (no progress 4m 22s)
  [HIGH] 15:42:09 — support-agent-prod: Feeding Sentinel + Agent911

Escalation:
  → Sentinel: notified
  → Agent911: snapshot triggered
  → Webhook: alert sent

Configuration

# ~/.acme/watchdog.yaml
probes:
  liveness:
    interval_seconds: 30
    timeout_seconds: 5
    missed_threshold: 3    # Escalate after N missed probes

  progress:
    stall_threshold_seconds: 300   # Escalate if no progress for this long
    loop_threshold: 5              # Escalate after N repeated outputs
    silence_threshold_seconds: 120 # Escalate if silent for this long

alerts:
  channels:
    - type: webhook
      url: https://your-alerting-system.com/webhook
    - type: email
      to: oncall@yourcompany.com

# Integration with ACME stack
integrations:
  sentinel: true
  agent911: true

CLI Reference

# Start/stop/restart
acme watchdog start
acme watchdog stop
acme watchdog restart

# Status and health
acme watchdog status
acme watchdog status --format json
acme watchdog health --agent <name>

# Logs and detections
acme watchdog log
acme watchdog log --tail 50 --follow
acme watchdog log --since 1h --format json

# Probe management
acme watchdog probe run --agent <name>    # Manual probe
acme watchdog probe history               # Probe history

# Configuration
acme watchdog config show
acme watchdog config set probes.liveness.interval_seconds 15

Watchdog vs Sentinel

Watchdog and Sentinel are complementary:
ToolWhat It WatchesHow
WatchdogGateway liveness, heartbeat, throughputActive probes + passive monitoring
SentinelRuntime anomalies, stalls, behavioral patternsContinuous passive observation
Watchdog is the active verifier. Sentinel is the passive observer. Together they cover the full failure surface. When both are configured, they share detections with Agent911 — correlated events surface as higher-confidence incidents.

Escalation Path

When Watchdog detects a problem:
Watchdog detects stall

        ├──▶ Sentinel notified (if configured)
        ├──▶ Agent911 snapshot triggered (if configured)
        ├──▶ Webhook alert sent (if configured)
        └──▶ Watchdog log entry (always)
Operators respond through Agent911’s guided playbook — Watchdog surfaces the problem and hands it off.

Pricing

Watchdog is a paid product. It’s designed for teams running production agents where missed heartbeats and silent stalls have real consequences. See Pricing for current rates.

Next Steps