Scores

Scores are structured evaluation records that attach quality measurements to your LLM activity at the level of a trace, session, or dataset run. They are the foundation of the evaluation infrastructure; every automated evaluator, human annotation, and custom quality check produces scores that flow into this unified system. Each score can include an optional comment to capture rationale or reviewer notes.

Why Scores Matter

LLM outputs are non-deterministic and subjective. Traditional metrics like latency and error rates tell you if your system is running, but not if it is producing good results. Scores provide the quality signal needed to:

  • Assess output quality across correctness, relevance, helpfulness, and safety

  • Compare performance between prompt versions, models, or configurations

  • Identify regressions before they impact users

  • Build feedback loops between production data and model improvement

  • Establish baselines for automated evaluation and human review

Overall, Scores enable trace segmentation (filter by quality rating), in-depth analytics (drill-downs by use case and user segment), and trend visualization over time.


Score Types

InteractiveAI supports three score types. Most production systems use a mix of all three across different evaluation dimensions.

Type
Description
Examples

Numeric

Continuous values within a defined range

0.87, 4.5, 92

Categorical

Predefined labels from a fixed set

"good", "needs_review", "rejected"

Boolean

Pass/fail or true/false flags

True, False


Viewing Scores

Scores are accessible from multiple locations throughout the platform depending on your workflow.

Scores Page

The dedicated Scores page under Observability provides a centralized view of all scores across your project. Use this view to filter, search, and analyze scores independently of their parent traces or sessions.

Trace Detail View

When inspecting a specific trace, navigate to the Scores tab in the detail panel to see all scores attached to that trace and its observations. This view displays the score name, value, and any associated comments.

Table Views (Traces, Sessions, Datasets)

For Traces, Sessions, and Datasets tables, you can add score columns directly to the table for at-a-glance comparison across multiple items:

  1. Click the columns icon in the top-right corner of the table

  2. In the Columns panel, scroll to the Scores section

  3. Select the scores you want to display as columns

This approach is useful for sorting and filtering large sets of traces or sessions by specific quality metrics.


Score Sources

Scores can be generated through three methods:

Source
Description

LLM-as-a-Judge

Automated evaluation using a secondary LLM to grade outputs on criteria like factuality, style compliance, or toxicity

Human Annotation

Manual review by your team through the annotation interface, establishing ground-truth benchmarks

Custom Evaluation via SDK

Programmatic scoring using custom quality checks, schema validation, or complex LLM workflows


Properties of a Score

Property
Description

Trace Name

Name of the associated trace

Trace

Trace ID of the trace this score belongs to

Environment

Deployment context like production, staging, or development

User

End-user associated with the scored trace

Timestamp

Creation time of the score

Source

Origin of the score: EVAL (automated evaluation) or ANNOTATION (human annotator)

Name

Identifier for the score type (e.g., "correctness", "helpfulness", "toxicity")

Data Type

One of NUMERIC, BOOLEAN, or CATEGORICAL

Value

The raw score value. Numeric for numeric/boolean scores; string for categorical

Metadata

Free-form JSON for extra context

Comment

Free-text notes, such as evaluator feedback or reasoning

Author

User or system that created the score

Eval Configuration ID

References a predefined score configuration that defines the schema, type, range, and categories

Trace Tags

Tags inherited from the associated trace


Creating Scores

Use create_score() to:

Attach a score to a specific trace by ID:

Score the Current Trace

When working within an active trace context, use score_current_trace() for convenience:

Score the Current Observation

Score the specific observation you're currently in:

Adding Metadata to Scores

Include additional context with your scores using the metadata parameter:

Last updated

Was this helpful?