Watchdog

Watchdog continuously verifies that your agents are not just alive — but actually making progress. It monitors gateway and agent liveness, probes for stall signatures, and escalates before silent throughput collapse becomes an outage.

The Gap Between “Running” and “Working”

Process monitors and uptime checkers answer one question: is the process alive? That’s necessary but insufficient for agents. An agent can be:

Running but stalled — process is up, no work happening
Running but looping — executing continuously, not advancing
Running but silent — heartbeat looks fine, outputs have stopped

Watchdog watches for these conditions — the gap between the process being alive and the agent actually working.

What Watchdog Does

Gateway Probes

Actively probes gateway health at regular intervals. Not passive — verified.

Stall Signature Detection

Recognizes patterns consistent with execution stalls before they cause failures.

Throughput Monitoring

Tracks whether work is actually completing, not just whether the agent is active.

Escalation Pipeline

Routes detections to Sentinel, Agent911, and configured alert channels.

How Watchdog Works

Watchdog runs as a persistent background monitor with two probe types:

Liveness Probes

Active checks that the gateway and agent runtime are responsive:

Every 30s:
  → Probe gateway endpoint
  → Verify response within SLA window (default: 5s)
  → Record heartbeat age
  → Escalate if missed for N consecutive probes

Progress Probes

Passive observation that meaningful work is completing:

Continuously:
  → Monitor context update frequency
  → Track tool call completion rate
  → Watch for repeated-output patterns (loop detection)
  → Measure throughput vs. expected baseline

Detection Scope

Detection	Trigger
Missed heartbeat	No liveness response for N consecutive intervals
Stall	No meaningful progress for threshold window
Gateway unresponsive	Probe timeout with no response
Silent throughput collapse	Outputs dropped to zero without process failure
Loop detected	Repeated identical outputs N times in succession
Abnormal latency	Response times exceed established baseline by threshold
Probe SLA breach	Gateway responded but took longer than configured SLA

Quick Start

# Start Watchdog
acme watchdog start

# Check status
acme watchdog status

# View recent detections
acme watchdog log --tail 20

Sample Output

Watchdog Monitor
================
Status: ACTIVE
Agents watched: 3

Agent Health:
  ✓ pipeline-agent       ALIVE    Heartbeat: 8s ago
  ✓ indexer-agent        ALIVE    Heartbeat: 12s ago
  ⚠ support-agent-prod   STALL    No progress: 4m 22s

Recent Detections:
  [HIGH] 15:42:09 — support-agent-prod: Stall detected (no progress 4m 22s)
  [HIGH] 15:42:09 — support-agent-prod: Feeding Sentinel + Agent911

Escalation:
  → Sentinel: notified
  → Agent911: snapshot triggered
  → Webhook: alert sent

Configuration

# ~/.acme/watchdog.yaml
probes:
  liveness:
    interval_seconds: 30
    timeout_seconds: 5
    missed_threshold: 3    # Escalate after N missed probes

  progress:
    stall_threshold_seconds: 300   # Escalate if no progress for this long
    loop_threshold: 5              # Escalate after N repeated outputs
    silence_threshold_seconds: 120 # Escalate if silent for this long

alerts:
  channels:
    - type: webhook
      url: https://your-alerting-system.com/webhook
    - type: email
      to: oncall@yourcompany.com

# Integration with ACME stack
integrations:
  sentinel: true
  agent911: true

CLI Reference

# Start/stop/restart
acme watchdog start
acme watchdog stop
acme watchdog restart

# Status and health
acme watchdog status
acme watchdog status --format json
acme watchdog health --agent <name>

# Logs and detections
acme watchdog log
acme watchdog log --tail 50 --follow
acme watchdog log --since 1h --format json

# Probe management
acme watchdog probe run --agent <name>    # Manual probe
acme watchdog probe history               # Probe history

# Configuration
acme watchdog config show
acme watchdog config set probes.liveness.interval_seconds 15

Watchdog vs Sentinel

Watchdog and Sentinel are complementary:

Tool	What It Watches	How
Watchdog	Gateway liveness, heartbeat, throughput	Active probes + passive monitoring
Sentinel	Runtime anomalies, stalls, behavioral patterns	Continuous passive observation

Watchdog is the active verifier. Sentinel is the passive observer. Together they cover the full failure surface. When both are configured, they share detections with Agent911 — correlated events surface as higher-confidence incidents.

Escalation Path

When Watchdog detects a problem:

Watchdog detects stall
        │
        ├──▶ Sentinel notified (if configured)
        ├──▶ Agent911 snapshot triggered (if configured)
        ├──▶ Webhook alert sent (if configured)
        └──▶ Watchdog log entry (always)

Operators respond through Agent911’s guided playbook — Watchdog surfaces the problem and hands it off.

Pricing

Watchdog is a paid product. It’s designed for teams running production agents where missed heartbeats and silent stalls have real consequences. See Pricing for current rates.

Next Steps

Sentinel

Passive runtime anomaly detection. Pairs with Watchdog for full coverage.

Agent911

Watchdog feeds detections into the Agent911 control plane.

DriftGuard

Long-horizon behavioral drift. Watchdog handles the short horizon.

Lazarus

Confirm recovery readiness so Watchdog escalations have a clear path.

Getting Started

Free Tools

Runtime Protection

Control & Recovery

Governance

Architecture

Support

Watchdog Overview

Watchdog

The Gap Between “Running” and “Working”

What Watchdog Does

Gateway Probes

Stall Signature Detection

Throughput Monitoring

Escalation Pipeline

How Watchdog Works

Liveness Probes

Progress Probes

Detection Scope

Quick Start

Sample Output

Configuration

CLI Reference

Watchdog vs Sentinel

Escalation Path

Pricing

Next Steps

Sentinel

Agent911

DriftGuard

Lazarus

Getting Started

Free Tools

Runtime Protection

Control & Recovery

Governance

Architecture

Support

Documentation Index

​Watchdog

​The Gap Between “Running” and “Working”

​What Watchdog Does

Gateway Probes

Stall Signature Detection

Throughput Monitoring

Escalation Pipeline

​How Watchdog Works

​Liveness Probes

​Progress Probes

​Detection Scope

​Quick Start

​Sample Output

​Configuration

​CLI Reference

​Watchdog vs Sentinel

​Escalation Path

​Pricing

​Next Steps

Sentinel

Agent911

DriftGuard

Lazarus

Watchdog

The Gap Between “Running” and “Working”

What Watchdog Does

How Watchdog Works

Liveness Probes

Progress Probes

Detection Scope

Quick Start

Sample Output

Configuration

CLI Reference

Watchdog vs Sentinel

Escalation Path

Pricing

Next Steps