Anthropic extended thinking needs session recovery, not manual transcript surgery

Anthropic extended thinking is useful for research agents, coding agents, and other long-running workflows, but it also changes the shape of a session transcript. Thinking blocks, signatures, prompt-cache state, streaming events, and tool calls now have to survive restarts and idle gaps. If one of those pieces goes stale, the right answer is recovery. The wrong answer is asking the user to delete the transcript and lose the work.

Table of contents:

Why Anthropic extended thinking changes the failure model

Anthropic extended thinking lets Claude create thinking content blocks before it returns final text blocks. Anthropic’s API examples show those thinking blocks carrying a signature, and AWS Bedrock’s Claude documentation tells builders to be ready to handle both thinking and text blocks when streaming.

That is a different contract from a plain chat stream. A normal assistant turn can often be replayed as text plus tool calls. An extended thinking turn carries provider-specific bookkeeping. The transcript is no longer just conversation memory; part of it is a signed artifact that the provider may validate on the next turn.

That matters most for agents that run for hours. A user can start a research session, call tools, leave the agent idle, restart a local Gateway, and come back expecting the session to continue. In that path, the hard problem is not generating better reasoning. It is keeping the session recoverable when the provider rejects stale thinking state.

Agent layerWhat can go staleFailure symptom
Prompt cacheCached prefix expires after the default 5-minute lifetimeReplay no longer matches the provider’s cached state
TranscriptOld thinking blocks are replayed with invalid signaturesProvider rejects the request before tokens arrive
Streaming adapterThe runtime marks a stream as started too earlyRecovery code never sees the original provider error cleanly
Tool loopTool calls make the transcript longer and more structuredManual cleanup becomes risky and lossy

The table is the reason this is a runtime issue, not a prompt-writing issue. Better instructions cannot repair a signed block that the provider rejects.

The failure pattern: cache expiry, restart, invalid signature

OpenClaw issue #90667 captured the concrete bug behind the fix. A claude-sonnet-4-6 agent with thinkingLevel: high or thinking: adaptive could run a multi-turn session with tool calls, then become permanently broken after a Gateway restart or Anthropic prompt-cache expiry.

The user-visible error was generic: LLM request failed: provider rejected the request schema or tool payload. The Gateway log carried the real clue: Invalid signature in thinking block. The report also noted the painful operational detail. Every later message failed in roughly 300 to 400 milliseconds with zero tokens consumed, and manual recovery meant deleting the session .jsonl transcript.

That last part is what makes the bug expensive. A research agent may have spent an hour collecting sources, narrowing a claim, and deciding what to ignore. If the only recovery path is transcript deletion, the user loses context even though the final answer may be only one turn away.

Anthropic’s prompt caching docs explain one piece of the trigger. By default, cache entries have a 5-minute lifetime, refreshed when cached content is reused. AWS also notes that changes to the thinking budget invalidate cached prompt prefixes that include messages, while cached system prompts and tool definitions continue to work. In other words, extended thinking sessions have legitimate cache and replay boundaries. Agent runtimes need to treat those boundaries as expected events.

What OpenClaw changed in 2026.6.5-beta.2

The 2026.6.5-beta.2 release note says Anthropic extended-thinking sessions now recover after prompt-cache expiry or Gateway restart because stream start events wait for message_start, which lets pre-generation signature errors trigger the existing recovery retry.

That sounds small. It is not. The fix moves recovery to the correct point in the stream lifecycle.

Before the fix, a runtime could treat the stream as started before the provider had actually emitted message_start. If Anthropic rejected the replayed history before generation, the failure landed in the wrong state. The recovery helper existed, but the direct Anthropic path did not reliably call it for this class of error. The agent looked alive from the outside and dead from the user’s seat.

After the fix, the runtime waits for the provider’s actual start signal. If the provider rejects a stale thinking signature before generation, the recovery path can strip or repair the unsafe thinking state and retry instead of poisoning the session forever.

For OpenClaw users, the product lesson is simple: agent runtime reliability depends on the boring edges between providers, transcripts, and streams. Features like context-window debugging and the earlier LLM idle watchdog solve adjacent problems, but extended thinking adds a provider-signature edge that needs its own recovery rule.

A recovery checklist for extended thinking agents

If you are building or operating an Anthropic extended thinking agent, design for these checks before you depend on it for long sessions.

  1. Wait for the provider’s real start event. Do not mark a stream as safely underway until the provider emits the event that means generation has begun. For Anthropic message streams, message_start is the boundary that matters here.
  2. Classify pre-generation provider errors separately. A schema rejection before tokens arrive should not be handled like a mid-stream network interruption. It is a replay or payload problem until proven otherwise.
  3. Treat thinking blocks as provider state, not normal memory. The user may benefit from summaries or decisions, but the raw signed block is not the durable memory primitive you want to rely on across restarts.
  4. Keep a repair path that preserves useful context. Strip stale thinking artifacts, keep user messages, tool results, citations, and final assistant text where safe, then retry once through the normal provider path.
  5. Log the recovery reason. Operators need to know whether a retry came from cache expiry, invalid signature, restart replay, or a generic provider 400. Otherwise every failure collapses into “Claude broke.”
  6. Test idle sessions with tools. A single-turn math prompt will not reveal this class of bug. Use a multi-turn session with web search or file tools, restart the Gateway, wait beyond the cache window, then send another message.

This is also where audit logs for AI agents become useful. You do not need to store raw private reasoning forever. You do need enough structured evidence to explain why the runtime stripped a block, retried, and kept the session alive.

What to store instead of raw thinking

A durable agent transcript should preserve what the next turn needs, not every internal token the model produced. For extended thinking sessions, that usually means storing:

  • user messages and final assistant answers;
  • tool calls, tool results, and source URLs;
  • model, provider, and thinking configuration;
  • recovery markers such as invalid_thinking_signature_retried;
  • short summaries of decisions that matter for future turns.

It usually does not mean replaying every raw thinking block forever. Anthropic and AWS both document extended thinking as a special response form with its own streaming, budget, and cache behavior. Treating those blocks like plain markdown is how a local transcript becomes a broken replay payload.

OpenClaw’s broader direction is to make this kind of operational boundary visible: why OpenClaw goes beyond “run an agent locally.” It is owning the parts of the agent stack that decide whether a long-running session can recover after the provider, Gateway, network, or channel layer changes underneath it.

FAQ

What is Anthropic extended thinking?

Anthropic extended thinking is a Claude API mode that lets the model spend a configured token budget on internal reasoning before returning a final answer. The API can return thinking blocks and final text blocks, so clients need to handle both.

Why can extended thinking sessions fail after cache expiry?

They can fail when replayed session history contains thinking blocks or signatures that no longer match the provider’s expected state. Anthropic prompt caching has a default 5-minute lifetime, and extended thinking adds signed provider artifacts to the transcript.

Should an agent store raw thinking blocks?

Store them only if you have a clear reason and a safe replay policy. For most durable memory, final answers, tool results, citations, model settings, and decision summaries are safer than raw thinking blocks.

Does this mean extended thinking is unsafe to use?

No. It means extended thinking needs runtime support. The feature is powerful for complex tasks, but long-running agents need cache-aware replay, stream-start boundaries, and transcript repair.

Sources: OpenClaw v2026.6.5-beta.2 release notes, OpenClaw issue #90667: extended thinking session recovery, Anthropic docs: building with extended thinking, Anthropic docs: prompt caching, AWS Bedrock docs: Claude extended thinking