Scores
Scores are structured evaluation records that attach quality measurements to your LLM activity at the level of a trace, session, or dataset run. They are the foundation of the evaluation infrastructure; every automated evaluator, human annotation, and custom quality check produces scores that flow into this unified system. Each score can include an optional comment to capture rationale or reviewer notes.
Why Scores Matter
LLM outputs are non-deterministic and subjective. Traditional metrics like latency and error rates tell you if your system is running, but not if it is producing good results. Scores provide the quality signal needed to:
Assess output quality across correctness, relevance, helpfulness, and safety
Compare performance between prompt versions, models, or configurations
Identify regressions before they impact users
Build feedback loops between production data and model improvement
Establish baselines for automated evaluation and human review
Overall, Scores enable trace segmentation (filter by quality rating), in-depth analytics (drill-downs by use case and user segment), and trend visualization over time.
For the full Scoring API reference including all method signatures and parameters, see the SDK Documentation.
Score Types
InteractiveAI supports three score types. Most production systems use a mix of all three across different evaluation dimensions.
Numeric
Continuous values within a defined range
0.87, 4.5, 92
Categorical
Predefined labels from a fixed set
"good", "needs_review", "rejected"
Boolean
Pass/fail or true/false flags
True, False
Score Sources
Scores can be generated through three methods:
LLM-as-a-Judge
Automated evaluation using a secondary LLM to grade outputs on criteria like factuality, style compliance, or toxicity
Human Annotation
Manual review by your team through the annotation interface, establishing ground-truth benchmarks
Custom Evaluation via SDK
Programmatic scoring using custom quality checks, schema validation, or complex LLM workflows
Creating Scores
Scoring a Trace
Attach a score to a specific trace by ID:
Use score_trace() on the span object for direct access, or create_score() with a trace ID:
Inside a decorated function, use score_current_trace():
Scoring an Observation
Attach a score to a specific observation within a trace:
Use score() on the observation object, or create_score() with both trace and observation IDs:
Inside a decorated function, use score_current_span():
Scoring a Session
Sessions are scored when using create_score() with a session_id. There is no score_current_session() method because sessions span multiple traces and are not tied to a single execution context.
Adding Metadata to Scores
Include additional context with your scores using the metadata parameter:
Viewing Scores
Scores are accessible from multiple locations throughout the platform depending on your workflow.
Scores Page
The dedicated Scores page under Govern provides a centralized view of all scores across your project. Use this view to filter, search, and analyze scores independently of their parent traces or sessions.

Trace Detail View
When inspecting a specific trace, navigate to the Scores tab in the detail panel to see all scores attached to that trace and its observations. This view displays the score name, value, and any associated comments.

Table Views (Traces, Sessions, Datasets)
For Traces, Sessions, and Datasets tables, you can add score columns directly to the table for at-a-glance comparison across multiple items:
Click the columns icon in the top-right corner of the table
In the Columns panel, scroll to the Scores section
Select the scores you want to display as columns

This approach is useful for sorting and filtering large sets of traces or sessions by specific quality metrics.
Properties of a Score
Property
Description
Trace Name
Name of the associated trace
Trace
Trace ID of the trace this score belongs to
Environment
Deployment context like production, default, or development
User
End-user associated with the scored trace
Timestamp
Creation time of the score
Source
Origin of the score: EVAL (automated evaluation) or ANNOTATION (human annotator)
Name
Identifier for the score type (e.g., "correctness", "helpfulness", "toxicity")
Data Type
One of NUMERIC, BOOLEAN, or CATEGORICAL
Value
The raw score value. Numeric for numeric/boolean scores; string for categorical
Metadata
Free-form JSON for extra context
Comment
Free-text notes, such as evaluator feedback or reasoning
Author
User or system that created the score
Eval Configuration ID
References a predefined score configuration that defines the schema, type, range, and categories
Trace Tags
Tags inherited from the associated trace
Last updated
Was this helpful?

