Monitor
Once your services are deployed, the Monitoring page provides real-time infrastructure visibility into every running workload. While the Reporting dashboards focus on LLM-level metrics like traces, latency, and token costs, Monitor operates at the infrastructure layer: how much traffic your containers are handling, whether requests are succeeding, how fast responses are being served, and whether your pods have enough CPU and memory to operate reliably.
This distinction matters. A service can return correct LLM outputs while silently running out of memory, or it can show healthy pod metrics while returning degraded responses due to a bad prompt version. Monitor and Reporting are complementary views: one watches the infrastructure, the other watches the AI behavior. Together, they give you complete operational coverage.

Dashboard Controls
The top of the page provides three controls for customizing your view:
Agent selector is a dropdown in the breadcrumb bar that lets you filter the monitoring view by a specific agent. It lists all agents in the project with a search field, and includes an "All agents" option to show everything and a "+ New agent" shortcut. Use this to isolate a single agent when debugging or to compare traffic patterns across agents.
Workload filter:The "All workloads" dropdown lets you select which deployed services to display. Use this to isolate a specific workload when debugging or to compare multiple services side by side.
Time range: Choose the observation window: 5m, 15m, 1h, 6h (default), 24h, or 7d. Shorter windows are useful for debugging active incidents, while longer windows help identify trends and patterns.
Auto-refresh: Toggle automatic dashboard refresh on or off. When enabled, the charts update periodically without requiring manual page reloads.
Traffic Metrics
Total Requests per Domain
Displays the request volume over time for each deployed domain. Each workload's domain appears as a separate series in the chart legend. Hover over any data point to see the exact request count at that timestamp.
Use this chart to understand traffic patterns, identify unexpected spikes, and verify that your service is receiving requests after deployment.

Total Requests per Status Code
Breaks down request volume by HTTP status code category: 2xx (success, green), 4xx (client errors, yellow), and 5xx (server errors, red). Hover over any bar to see the exact count for each category at that timestamp.
A healthy service shows predominantly green bars. A sudden increase in red (5xx) bars indicates server-side failures that need investigation. Yellow (4xx) bars may indicate client-side issues like malformed requests or authentication problems.

Performance Metrics
Response Time
Tracks response latency over time using three percentile lines: p50 (median, blue), p95 (teal), and p99 (orange). Hover over the chart to see exact values at any point in time.
The gap between p50 and p99 reveals tail latency. If p50 is 9ms but p99 is 2.5s, most requests are fast but a small percentage experience significant delays. This pattern often points to cold starts, resource contention, or specific request types that trigger slower code paths.

Resource Metrics
CPU Usage
Displays CPU consumption in cores over time for each workload. Hover over the chart to see the exact core usage at any point.
CPU Resources
A summary table showing current CPU allocation and utilization per workload:
Name
Workload name
Replicas
Number of running replicas
Allocated
Total CPU cores allocated (per replica)
Usage
Current CPU consumption in cores
Usage %
Current usage as a percentage of allocated CPU
Max Usage %
Peak usage percentage observed in the selected time range

If Usage % consistently approaches 100%, your workload needs more CPU allocation or additional replicas. If it stays very low, you may be over-provisioned.
Memory Usage
Displays memory consumption over time in MB for each workload. Hover over the chart to see the exact memory usage at any point.
Memory Resources
A summary table showing current memory allocation and utilization per workload:
Name
Workload name
Replicas
Number of running replicas
Allocated
Total memory allocated (per replica)
Usage
Current memory consumption
Usage %
Current usage as a percentage of allocated memory
Max Usage %
Peak usage percentage observed in the selected time range

Memory that grows steadily over time without releasing may indicate a memory leak. If Usage % approaches 100%, the container risks being killed by the orchestrator (OOMKilled), which causes service interruptions.
Last updated
Was this helpful?

