For the complete documentation index, see llms.txt. This page is also available as Markdown.

Scores

Scores are structured evaluation records that attach quality measurements to your LLM activity at the level of a trace, session, or dataset run. They are the foundation of the evaluation infrastructure; every automated evaluator, human annotation, and custom quality check produces scores that flow into this unified system. Each score can include an optional comment to capture rationale or reviewer notes.

Why Scores Matter

LLM outputs are non-deterministic and subjective. Traditional metrics like latency and error rates tell you if your system is running, but not if it is producing good results. Scores provide the quality signal needed to:

  • Assess output quality across correctness, relevance, helpfulness, and safety

  • Compare performance between prompt versions, models, or configurations

  • Identify regressions before they impact users

  • Build feedback loops between production data and model improvement

  • Establish baselines for automated evaluation and human review

Overall, Scores enable trace segmentation (filter by quality rating), in-depth analytics (drill-downs by use case and user segment), and trend visualization over time.

For the full Scoring API reference including all method signatures and parameters, see the SDK Documentation.


Score Types

InteractiveAI supports three score types. Most production systems use a mix of all three across different evaluation dimensions.

Type
Description
Examples

Numeric

Continuous values within a defined range

0.87, 4.5, 92

Categorical

Predefined labels from a fixed set

"good", "needs_review", "rejected"

Boolean

Pass/fail or true/false flags

True, False


Score Sources

Scores can be generated through three methods:

Source
Description

LLM-as-a-Judge

Automated evaluation using a secondary LLM to grade outputs on criteria like factuality, style compliance, or toxicity

Human Annotation

Manual review by your team through the annotation interface, establishing ground-truth benchmarks

Custom Evaluation via SDK

Programmatic scoring using custom quality checks, schema validation, or complex LLM workflows


Creating Scores

Scoring a Trace

Attach a score to a specific trace by ID:

Use score_trace() on the span object for direct access, or create_score() with a trace ID:

Scoring an Observation

Attach a score to a specific observation within a trace:

Use score() on the observation object, or create_score() with both trace and observation IDs:

Scoring a Session

Sessions are scored when using create_score() with a session_id. There is no score_current_session() method because sessions span multiple traces and are not tied to a single execution context.

Adding Metadata to Scores

Include additional context with your scores using the metadata parameter:


Viewing Scores

Scores are accessible from multiple locations throughout the platform depending on your workflow.

Scores Page

The dedicated Scores page under Govern provides a centralized view of all scores across your project. Use this view to filter, search, and analyze scores independently of their parent traces or sessions.

Trace Detail View

When inspecting a specific trace, navigate to the Scores tab in the detail panel to see all scores attached to that trace and its observations. This view displays the score name, value, and any associated comments.

Table Views (Traces, Sessions, Datasets)

For Traces, Sessions, and Datasets tables, you can add score columns directly to the table for at-a-glance comparison across multiple items:

  1. Click the columns icon in the top-right corner of the table

  2. In the Columns panel, scroll to the Scores section

  3. Select the scores you want to display as columns

This approach is useful for sorting and filtering large sets of traces or sessions by specific quality metrics.


Properties of a Score

Property

Description

Trace Name

Name of the associated trace

Trace

Trace ID of the trace this score belongs to

Environment

Deployment context like production, default, or development

User

End-user associated with the scored trace

Timestamp

Creation time of the score

Source

Origin of the score: EVAL (automated evaluation) or ANNOTATION (human annotator)

Name

Identifier for the score type (e.g., "correctness", "helpfulness", "toxicity")

Data Type

One of NUMERIC, BOOLEAN, or CATEGORICAL

Value

The raw score value. Numeric for numeric/boolean scores; string for categorical

Metadata

Free-form JSON for extra context

Comment

Free-text notes, such as evaluator feedback or reasoning

Author

User or system that created the score

Eval Configuration ID

References a predefined score configuration that defines the schema, type, range, and categories

Trace Tags

Tags inherited from the associated trace

Last updated

Was this helpful?