> For the complete documentation index, see [llms.txt](https://docs.interactive.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.interactive.ai/agents/guides/observability.md).

# Observability

> **Context** — Every turn an agent takes is traced (OpenTelemetry) and logged (structured JSON). This guide shows how to read both, how to make traces searchable by *your* identifiers, and which signals deserve alerts.
>
> YAML examples follow **manifest schema 6.1.1**. Manifest and content shapes are schema-versioned and differ across runtime versions — see [Versioning & compatibility](/agents/operations/versioning.md).

## Traces

The agent exports OpenTelemetry traces over OTLP/HTTP. By default they go to the InteractiveAI platform's traces backend (derived from `interactive_platform.base_url`, authenticated with the platform keys) and appear in the platform's Traces view. A custom backend is one manifest block away:

```yaml
agent_config:
  traces:
    deployment_environment: production
    backend:
      url: https://otel.your-provider.com/v1/traces
      api_key: ${OTEL_API_KEY}
      api_key_scheme: bearer    # or "basic" for public:secret key pairs
```

`deployment_environment` (default `production`) tags every trace with `deployment.environment` — it's the environment filter in the trace UI, so staging and production agents with the same name stay separable.

### What a trace contains

One trace per turn (conversational) or per run (autonomous). Inside it: policy matching batches, routine evaluation decisions, every model call (chat and evaluation lanes), every tool call with arguments and results, knowledge-base retrievals with the rewritten query, and the final reply. Each turn's trace also carries a metadata snapshot: the session metadata verbatim and the resolved context variables exactly as fed to the model — so "what did the agent know?" is answerable months later.

The trace's **input** is the customer message; its **output** is the turn's reply messages (or the autonomous run's typed output).

### Trace naming

Traces group and name themselves off one resolved **resource id**, in precedence order:

1. **Autonomous runs** — the value of the input field named by `agent_config.traces.trace_id_field`. Set it to your business key:

   ```yaml
   agent_config:
     traces:
       trace_id_field: customer_id
   ```

   A run triggered with `{"input": {"customer_id": "cus_abc", ...}}` then traces as `{agent}-cus_abc` instead of a synthetic run id.
2. **Conversational sessions** — `session.metadata["session_key"]`, an optional key your integration sets when opening the session. Session ids are opaque server-generated hashes; `session_key` is how you attach a stable, human-meaningful identifier (a ticket id, a case number):

   ```python
   sess = await client.sessions.open(
       id="zendesk-42",
       metadata={"session_key": "TICKET-7841"},
   )
   ```

   Every turn of that conversation then groups under `TICKET-7841` in the trace UI's session view, named `{agent}-TICKET-7841`, with the user-id column filterable by the same key. The key must be exactly `session_key`.
3. **Fallback** — the first 8 characters of the session id, so grouping always works.

The autonomous callback's `trace_id` field links a delivered result back to its trace directly.

## Logs

Everything the process emits — agent, HTTP server, engine — is single-line JSON on stdout:

```json
{"timestamp": "2026-06-04T10:15:02.114Z", "level": "info", "message": "Received event from session_id: sess_abc for routine_id: kyc-decision", "logger": "agent_server.endpoints.autonomous_routines", "request_id": "8f2a…", "session_id": "sess_abc", "routine_id": "kyc-decision", "run_id": "run_9f8e…"}
```

Conventions worth knowing when querying:

| Field                                     | Meaning                                                                                       |
| ----------------------------------------- | --------------------------------------------------------------------------------------------- |
| `timestamp`, `level`, `message`, `logger` | Always present.                                                                               |
| `request_id`, `method`, `path`            | Bound for every HTTP request.                                                                 |
| `session_id`, `trace_id`                  | Bound during engine turns.                                                                    |
| `duration_ms` / `duration_s`              | Timings — milliseconds generally, **seconds** for evaluation-phase logs.                      |
| `phase`                                   | Marks special subsystems: `eval` (boot-time evaluation), `retry-fallback` (model escalation). |

The manifest's `runtime.log_level` (default `INFO`) drives verbosity; `DEBUG` adds per-decision detail. Health-probe and event-polling requests are excluded from access logs by design.

## Boot-time evaluation logs

Cold-cache routine evaluation is the noisy phase. At `INFO` you get one bookend per routine:

```
Routine 'Book A Car' evaluated: 11 nodes in 154.2s
```

(`0 nodes` = served from cache.) At `DEBUG`, per-stage and per-step lines appear, all tagged `phase="eval"` and message-prefixed `[eval]`. Failures log at WARNING/ERROR regardless of level.

```bash
# everything eval-related, one agent, last 30 minutes
iai agents logs <agent> --since 30m | grep '\[eval\]'

# structured: every eval line via the phase field
iai agents logs <agent> --since 30m | jq 'select(.phase == "eval")'
```

Slow boots → [Startup evaluation](/agents/concepts/startup-evaluation.md#caching-cold-vs-warm-boots).

## Retry-fallback signals

When an evaluation call exhausts its 3 attempts and escalates to `evaluation_fallback` (see [Models](/agents/concepts/models.md)), the runtime logs it with a `[retry-fallback]` message marker and `phase="retry-fallback"`, including which call site escalated and both model names — kick-in and success at WARNING, both-models-exhausted at ERROR.

```bash
iai agents logs <agent> --since 30m | grep '\[retry-fallback\]'
```

What to alert on:

| Signal                                       | Meaning                                                           | Action                                              |
| -------------------------------------------- | ----------------------------------------------------------------- | --------------------------------------------------- |
| Occasional `[retry-fallback] … succeeded`    | Normal — the safety net working                                   | None                                                |
| Sustained escalation rate                    | Evaluation primary struggling with your content                   | Simplify conditions or promote `llms.evaluation`    |
| `[retry-fallback]` at ERROR (both exhausted) | A turn failed an internal decision                                | Investigate the trace; check router health          |
| `[eval] … FAILED`                            | A routine failed boot-time evaluation                             | Fix the routine; the boot log names it              |
| Retrieval warnings                           | Knowledge base unreachable/misbehaving — turns proceed ungrounded | Check KB health; answers degrade silently otherwise |

## A debugging workflow

"The agent did something odd in ticket 7841":

1. **Find the session** in the trace UI by `TICKET-7841` (you set `session_key`, right?). All its turns are grouped.
2. **Open the odd turn's trace.** Check, in order: which policies matched (and which surprisingly didn't), which routine/node was selected, what each tool returned, what the KB retrieval contributed.
3. **Check the config snapshot** on the trace — it records which policy and content versions were live, so "did yesterday's content release cause this?" is a lookup, not an archaeology dig.
4. **Correlate logs** by the trace id (`trace_id` field) for anything infrastructural (timeouts, reconnects, escalations).

Symptom-indexed problems live in [Troubleshooting](/agents/operations/troubleshooting.md).


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.interactive.ai/agents/guides/observability.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
