LLM idle watchdog: escalating silent model streams through profile rotation and model fallback

A silent model stream is one of the worst failure modes a self-hosted agent can hit. The provider accepts the request, the connection stays open, no error fires, and no tokens arrive. Without a watchdog, the agent turn sits there forever, holding a session slot, while the user stares at a blinking cursor.

The 2026.5.12 release closed this gap with a small but load-bearing fix. An LLM idle watchdog that now escalates timeouts through profile rotation and configured model fallback, instead of leaving the turn stuck. This post explains what was actually broken, what the fix changes, and how to set up the surrounding config so the watchdog has somewhere useful to escalate to.

What an LLM idle watchdog actually watches

An idle watchdog is not a request timeout. A request timeout caps the total time a single HTTP call can take. An idle watchdog measures time since the last byte arrived on a streaming response and trips when that gap exceeds a threshold.

For a streaming chat completion, the two failure shapes look different:

Failure modeWhat the request timeout catchesWhat the idle watchdog catches
Provider returns 5xx after 30sYes (via HTTP error)N/A; connection closed cleanly
Connection refused / DNS failsYes (immediate)N/A; no stream opened
Stream opens, then provider stalls mid-responseOnly if total deadline exceededYes (no bytes for N seconds)
Provider rate-limits silently, holds connection openNo; connection is “healthy”Yes (no bytes for N seconds)

The last two rows are why an idle watchdog matters. A 60-second request deadline does not help an agent that needs to emit 8,000 tokens of output. You cannot set the deadline tight enough to catch a stall without breaking legitimate long generations. Idle-gap detection is the only way to separate “still working” from “dead but pretending.”

What was broken before 2026.5.12

The bug report (#76877) describes the user-visible symptom: agent turns going silent after a model stream stalled, with no recovery and no escalation. The watchdog itself existed and could detect the idle gap, but when it tripped, the turn ended. The agent did not try the next auth profile, did not fall back to a configured backup model, and did not surface the failure as a structured retryable error. From the user’s seat, the agent simply stopped answering.

This is a worse failure than a hard error. A hard error is observable: logs show it, retries can fire, on-call gets paged. A silent stall produces no signal at all. For self-hosted operators, the only mitigation was an external supervisor restarting the agent process on a coarse heartbeat, which kills unrelated in-flight turns as collateral damage.

What the 2026.5.12 fix actually does

The release note is one sentence:

Agents: escalate LLM idle watchdog timeouts through profile rotation and configured model fallback instead of leaving agent turns stuck after a silent model stream. Fixes #76877. (#80449)

Unpacked, the new behavior runs in three steps when the watchdog trips:

  1. Profile rotation. The current auth profile is marked unhealthy for this turn. The runtime selects the next configured auth profile for the same provider, if you have multiple, and retries from a clean stream. Splitting rate limits across keys is the common reason to have more than one.
  2. Configured model fallback. If no further profiles are available, the runtime walks the configured model fallback chain (the same chain used by acp.fallbacks and the broader retry layer) and reissues the turn against the next backend.
  3. Structured failure. If neither escalation produces output, the turn ends with a typed error that downstream code (cron, ACP, channel adapters) can classify and retry, instead of an opaque silent hang.

The escalation is per-turn and stateless across turns. A profile that stalled once is not blacklisted globally. The runtime just avoids it for the current escalation chain.

How to configure the chain so the fix has somewhere to go

The fix is only useful if your config gives the watchdog targets to escalate to. A self-hosted install with a single API key and no fallback model has nothing to rotate to, and the fix degrades to the old behavior with a typed error at the end. To get real recovery, configure both layers.

Multiple auth profiles per provider

The auth profile system stores credentials in auth-profiles.json keyed by provider. To enable rotation, add more than one profile per provider you care about:

 auth login anthropic --profile primary
 auth login anthropic --profile backup

The watchdog will rotate primary to backup before giving up on the provider. If both share a single quota pool (same org account), rotation buys you nothing. The value is splitting across separately rate-limited keys or accounts.

Model fallback chain

The acp.fallbacks config that shipped in the same release wires up cross-backend escalation. A minimal example in .json:

{
  "acp": {
    "fallbacks": [
      { "agentRuntime": "claude-cli", "model": "claude-opus-4-7" },
      { "agentRuntime": "codex", "model": "gpt-5.3-codex" }
    ]
  }
}

The idle watchdog reuses this chain. If the primary backend stalls and no auth profile rotation recovers it, the next entry in acp.fallbacks is tried for the same turn. Pair this with a different vendor for the backup (Anthropic primary, OpenAI backup, or vice versa) so a provider-wide incident does not take both out at once.

Idle threshold

The watchdog threshold is conservative by default. Long enough that legitimate slow generations (long reasoning chains, big tool-call payloads, large refactors) are not interrupted. Operators who run agents on faster models and want tighter recovery can lower it; operators running deep-reasoning models should leave it alone. There is no single right number. The trade-off is “how long am I willing to wait for a real generation” versus “how long am I willing to sit on a stalled one.” Start with the default, watch your turn-failure logs for a week, and tune only if you see real stalls.

When the watchdog will not save you

The fix is targeted: it handles silent streams. It does not help with:

  • Bad prompts that produce empty completions. The model returned successfully with no content. That is a different bug, usually upstream in the agent’s prompt construction.
  • Stuck tool calls. If an agent is waiting on a tool that hangs (a shell command, an HTTP fetch, an MCP server that does not respond), the model stream is not the bottleneck. Tool timeouts are a separate layer.
  • Total provider outages with no backup configured. Rotation needs a target. If every fallback entry points at the same dead vendor, the typed error is the best you can get.
  • Long, intentional silences during reasoning. Some models emit nothing during extended internal reasoning before producing the final answer. If the silence exceeds your idle threshold, you will get false-positive escalations. Tune the threshold, or pin those agents to providers that emit periodic keep-alive tokens.

The honest framing: an idle watchdog is a backstop, not a substitute for picking reliable providers and writing prompts that produce output.

How this fits with the rest of the 2026.5.12 reliability pass

The 2026.5.12 release is heavy on reliability fixes that share a theme: don’t let one stuck component silently end a session. The watchdog escalation joins:

  • Telegram polling resilience. Ingress moved to an isolated worker with a durable local spool so polling survives main-loop stalls (#81746).
  • acp.fallbacks. ACP turns try backup runtime backends before any output is emitted (#69542).
  • Stale OAuth lock recovery. Auth reclaims dead-owner file locks before retrying locked writes, so crashed OAuth refreshes no longer wedge auth-profiles.json.
  • Gateway cold-start liveness. Startup grace window suppresses spurious liveness warnings while still sampling metrics (#81699).

Read together, the pattern is: every layer that can stall (channel ingress, model stream, auth refresh, gateway startup) now has either a watchdog with somewhere to escalate, or a recovery path that does not require human intervention. For self-hosted operators running agents unattended, this is the single most useful kind of change a release can ship. It reduces the rate at which you have to wake up.

FAQ

Does the LLM idle watchdog count tool-call streaming as activity? Yes. Any bytes received on the stream (content tokens, tool-call argument tokens, or framing) reset the idle clock. The watchdog only trips on a true silent gap.

What happens to partial output already emitted before the stall? The escalation reissues the turn from scratch against the next backend. Partial output from the stalled stream is discarded rather than stitched together, because mid-stream switching produces incoherent results.

Can I disable escalation and keep the old behavior? Not directly. The escalation runs whenever profiles or fallbacks are configured. To approximate the old behavior, leave both unconfigured. The watchdog will detect the stall, fail to find any escalation target, and end the turn with a typed error.

Does this work for non-ACP turns too? The fix is described as applying to “agent turns” broadly, with ACP fallbacks as the cross-backend escalation chain. In practice the watchdog runs at the runtime layer, so non-ACP harnesses (Codex, Claude CLI) get the same profile rotation behavior on stalls.

How do I know the watchdog actually fired? Watchdog escalations are logged with a structured event including the original profile, the escalation target, and the elapsed idle time. Filter your logs for the relevant event name to count occurrences per day. A sudden spike usually means a specific provider is degrading.

Putting an LLM idle watchdog to work

A watchdog with nowhere to escalate is just a more polite way to fail. The 2026.5.12 fix is real, but it only pays off if you configure at least one rotation target. Ideally a second auth profile and a different-vendor fallback model. For self-hosted operators of as a self-hosted AI assistant, that combination is what turns a silent stall from “agent dies, user notices” into “agent recovers, log line written.” The same release ships ACP backend failover as the cross-backend half of this story; together they cover most of the practical agent reliability gap that single-provider self-hosting hits in production.

Sources: