> For the complete documentation index, see [llms.txt](https://docs.interactive.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.interactive.ai/agents/operations/troubleshooting.md).

# Troubleshooting

> **Context** — Indexed by symptom. Each entry: what you see → what it means → what to do. Diagnosis tooling (traces, logs, markers) is covered in [Observability](/agents/guides/observability.md).

## Boot & readiness

### The container exits immediately

Read the last log lines — startup failures are loud and specific, and the process exits non-zero within 10 seconds by design:

| Log says                                                           | Cause                                                                                                                            | Fix                                                                                       |
| ------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- |
| Validation error listing manifest fields                           | Structural manifest problem (bad hostname pattern, port range, missing required field, literal where a `${VAR}` ref is required) | Fix every listed violation; the schema reports them all at once                           |
| `env var X is unset` (or a `ValueError` naming a variable)         | A `${VAR}` ref has no backing env var                                                                                            | Add the variable to the injected secrets — check the manifest's `secrets:` list covers it |
| Router key missing/empty                                           | `ROUTER_API_KEY` (or your chosen name) unset or blank                                                                            | Set it; the agent refuses to start rather than send unauthenticated model traffic         |
| Content reference not found (a routine/policy/prompt id + version) | The pinned version doesn't exist in the platform catalog                                                                         | Publish that version to the catalog, or fix the `id`/`version` pin in the manifest        |
| MCP server unreachable                                             | A declared `mcps[]` entry can't be contacted                                                                                     | Fix the address/network, or remove the entry — tool servers are part of the boot contract |
| Embedding dimension mismatch                                       | `search.embedding_dimensions` ≠ the pgvector column                                                                              | Align the manifest with the collection (or re-index)                                      |
| Webhook references unknown routine                                 | `webhooks[].routines` names a routine that isn't autonomous or isn't referenced                                                  | Fix the cross-reference                                                                   |

### `/health/ready` stays 503

Readiness flips only after content fetch + configuration apply. Stuck means one of those is hanging — usually platform connectivity or a slow MCP handshake. The log shows the last completed stage. Note that routine evaluation happens **after** readiness — a ready-but-slow-to-respond agent on first turns is the evaluation cache warming, not a readiness problem (see below).

### Boot succeeds but takes minutes; first turns are slow

Cold evaluation cache. Verify with the per-routine bookend lines (`Routine '<title>' evaluated: N nodes in Xs` — nonzero nodes = cache miss). Fix permanently by [startup evaluation](/agents/concepts/startup-evaluation.md#caching-cold-vs-warm-boots); mitigate by raising `EVAL_NODE_PARALLELISM` if your router budget allows.

## Wrong behaviour

### A policy doesn't fire when it should

1. Check the trace: the matching decision is recorded per policy per turn.
2. Usual causes, in order of frequency:
   * **Abstract condition** — rewrite with concrete triggers ([guide](/agents/guides/authoring-policies.md#writing-conditions-that-match-correctly)).
   * **Condition depends on data not yet in the conversation** (e.g. on a tool result no step has fetched).
   * **Stale state** — relevance flipped mid-turn after a tool ran; add the tool to `reevaluate_after` or `context.reevaluation_tools`.
   * **Lost a conflict** — a higher-priority policy/routine took precedence; check [priorities](/agents/concepts/priorities.md).
3. For must-always-hold rules, `always_match: true` removes the matcher from the equation.

### A routine activates at the wrong time (or never)

Activation is condition matching — same diagnosis as policies. The classic miss: two routines with overlapping conditions where the wrong one wins. Carve mutual boundaries into both conditions ("Do NOT activate when the user refers to an existing booking") and/or add an explicit [priority](/agents/concepts/priorities.md).

### The agent calls a tool and says nothing

A node combines `tools` with `chat_state` — the schema rejects the combination, and older content that slipped through behaves as a tool node with the chat text ignored. Split into two nodes connected by a transition. This is the #1 authoring bug: [the node types](/agents/concepts/routines.md#node-types-read-this-first).

### The agent ignores routine structure / freelances the flow

* The system prompt narrates a competing procedure — move procedure out of the prompt, keep persona ([guide](/agents/concepts/prompts.md#division-of-labour)).
* Mega-nodes invite improvisation — one concern per node ([anti-patterns](/agents/guides/authoring-routines.md#anti-patterns)).

### Answers lost their grounding (KB seems ignored)

Retrieval **soft-fails by design** — turns proceed without context. Check logs for retrieval warnings; check the trace's retrieval span (is the rewritten query sane? did snippets come back?). Common causes: KB down/unreachable, wrong `content_key` (rows found, empty bodies), filter excluding everything, external endpoint returning a non-bare-array shape. See [knowledge-base verification](/agents/guides/knowledge-base-setup.md#verifying-grounding-both-options).

### Replies in the wrong language

Check `context.language`. `match_user` mirrors the customer; a fixed language overrides them; a list prefers the customer's if listed. The greeting/preamble adapt to the same directive.

## Tools

### Tool calls fail at runtime

The failed call is visible in the trace (arguments + error) and in the event stream. Distinguish:

* **Transport failures** — the runtime reconnects to MCP servers automatically; sustained failures mean the server is down.
* **Tool-level errors** — your server returned an error result; the model reads it and follows its instructions. If it handles errors badly, the tool's error payload probably isn't self-explanatory — [return errors as data](/agents/guides/connecting-tools.md#3-designing-tools-the-agent-calls-well).
* **Wrong arguments** — improve the parameter docs and the node's `tool_instruction`; the model only knows what those say.

## Autonomous runs

Diagnose by the callback's `error.code` (full taxonomy: [Autonomous routines](/agents/concepts/autonomous-routines.md#failure-taxonomy)):

| Code                            | First check                                                                                                                                                                                              |
| ------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `timeout`                       | Trace shows where time went — slow tool? model latency? Raise `timeout_seconds` only after the trace says the path is legitimately long.                                                                 |
| `max_engine_iterations_reached` | Count the longest path's tool+think nodes (+1 for `emit_output`) against `runtime.max_engine_iterations`.                                                                                                |
| `output_validation_failed`      | Error details carry the JSON path; the terminal `tool_instruction` and `output_schema` disagree.                                                                                                         |
| `engine_error`                  | Open the trace via the callback's `trace_id`; check `[retry-fallback]` ERRORs around the timestamp.                                                                                                      |
| HTTP 400 at trigger             | Input vs `input_schema` — the response body names the path.                                                                                                                                              |
| HTTP 404 at trigger             | Routine id wrong, or the routine YAML has no `autonomous:` block.                                                                                                                                        |
| No callback arrives             | Check `callback_url_allowlist`; check your receiver returned 2xx (the server retries 5× with backoff, then gives up — delivery status is logged).                                                        |
| Webhook returns 401             | Signature mismatch: algorithm, prefix, header name, or secret value differ from the provider's signing config. The secret is re-read per request, so rotation is instant — make sure both sides rotated. |

## Performance

| Symptom                    | Likely cause                                                          | Lever                                                                                  |
| -------------------------- | --------------------------------------------------------------------- | -------------------------------------------------------------------------------------- |
| Slow replies, always       | Long tool chains per turn; chatty MCP servers; oversized tool results | Trace the turn; trim results; parallelise independent tools in one node                |
| Slow replies, occasionally | Evaluation-model retries/escalations                                  | `[retry-fallback]` rate; see [Models](/agents/concepts/models.md#operational-guidance) |
| Router throttling          | Scale-out multiplied call volume                                      | Router-side limits; reduce `EVAL_NODE_PARALLELISM` during boots                        |
| Polling clients hammered   | Busy-loop event polling                                               | The server emits `x-polling-backoff`; the SDK honours it — custom clients should too   |

## Sessions & state

| Symptom                                              | Cause                                                           | Fix                                                                                                   |
| ---------------------------------------------------- | --------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------- |
| Conversations reset after deploys                    | In-memory storage                                               | Add the `database:` block — [storage backends](/agents/concepts/memory-and-state.md#storage-backends) |
| Agent doesn't know what the integration just learned | Wrote metadata, not variables; or expected same-turn visibility | Write **variables**; they surface next turn                                                           |
| Replies land on the wrong replica's clients          | In-memory storage + multiple replicas                           | Postgres storage, or webhook event delivery (the URL travels on session metadata)                     |

## When you escalate

Capture: the trace id (or `session_key`/run id), the agent's runtime version (boot banner), manifest content versions (in the trace snapshot), and the relevant log excerpt. That tuple reproduces almost everything.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.interactive.ai/agents/operations/troubleshooting.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
