Scores
Scores are structured evaluation records that attach quality measurements to your LLM activity at the level of a trace, session, or dataset run. They are the foundation of the evaluation infrastructure; every automated evaluator, human annotation, and custom quality check produces scores that flow into this unified system. Each score can include an optional comment to capture rationale or reviewer notes.
Why Scores Matter
LLM outputs are non-deterministic and subjective. Traditional metrics like latency and error rates tell you if your system is running, but not if it is producing good results. Scores provide the quality signal needed to:
Assess output quality across correctness, relevance, helpfulness, and safety
Compare performance between prompt versions, models, or configurations
Identify regressions before they impact users
Build feedback loops between production data and model improvement
Establish baselines for automated evaluation and human review
Overall, Scores enable trace segmentation (filter by quality rating), in-depth analytics (drill-downs by use case and user segment), and trend visualization over time.
Score Types
InteractiveAI supports three score types. Most production systems use a mix of all three across different evaluation dimensions.
Numeric
Continuous values within a defined range
0.87, 4.5, 92
Categorical
Predefined labels from a fixed set
"good", "needs_review", "rejected"
Boolean
Pass/fail or true/false flags
True, False
Viewing Scores
Scores are accessible from multiple locations throughout the platform depending on your workflow.
Scores Page
The dedicated Scores page under Observability provides a centralized view of all scores across your project. Use this view to filter, search, and analyze scores independently of their parent traces or sessions.

Trace Detail View
When inspecting a specific trace, navigate to the Scores tab in the detail panel to see all scores attached to that trace and its observations. This view displays the score name, value, and any associated comments.

Table Views (Traces, Sessions, Datasets)
For Traces, Sessions, and Datasets tables, you can add score columns directly to the table for at-a-glance comparison across multiple items:
Click the columns icon in the top-right corner of the table
In the Columns panel, scroll to the Scores section
Select the scores you want to display as columns

This approach is useful for sorting and filtering large sets of traces or sessions by specific quality metrics.
Score Sources
Scores can be generated through three methods:
LLM-as-a-Judge
Automated evaluation using a secondary LLM to grade outputs on criteria like factuality, style compliance, or toxicity
Human Annotation
Manual review by your team through the annotation interface, establishing ground-truth benchmarks
Custom Evaluation via SDK
Programmatic scoring using custom quality checks, schema validation, or complex LLM workflows
Properties of a Score
Trace Name
Name of the associated trace
Trace
Trace ID of the trace this score belongs to
Environment
Deployment context like production, staging, or development
User
End-user associated with the scored trace
Timestamp
Creation time of the score
Source
Origin of the score: EVAL (automated evaluation) or ANNOTATION (human annotator)
Name
Identifier for the score type (e.g., "correctness", "helpfulness", "toxicity")
Data Type
One of NUMERIC, BOOLEAN, or CATEGORICAL
Value
The raw score value. Numeric for numeric/boolean scores; string for categorical
Metadata
Free-form JSON for extra context
Comment
Free-text notes, such as evaluator feedback or reasoning
Author
User or system that created the score
Eval Configuration ID
References a predefined score configuration that defines the schema, type, range, and categories
Trace Tags
Tags inherited from the associated trace
Creating Scores
Use create_score() to:
Attach a score to a specific trace by ID:
Attach a score to a specific observation within a trace:
Attach a score to an entire session:
Score the Current Trace
When working within an active trace context, use score_current_trace() for convenience:
Score the Current Observation
Score the specific observation you're currently in:
Adding Metadata to Scores
Include additional context with your scores using the metadata parameter:
Last updated
Was this helpful?

