# Scores

Scores are **structured evaluation** records that attach quality measurements to your LLM activity at the level of a **trace**, **session**, or **dataset run**. They are the foundation of the evaluation infrastructure; every automated evaluator, human annotation, and custom quality check produces scores that flow into this unified system. Each score can include an optional comment to capture rationale or reviewer notes.

### Why Scores Matter

LLM outputs are non-deterministic and subjective. Traditional metrics like latency and error rates tell you if your system is running, but not if it is producing good results. Scores provide the quality signal needed to:

* Assess **output quality** across correctness, relevance, helpfulness, and safety
* **Compare performance** between prompt versions, models, or configurations
* **Identify regressions** before they impact users
* Build **feedback loops** between production data and model improvement
* Establish **baselines** for automated evaluation and human review

Overall, Scores enable trace segmentation (filter by quality rating), in-depth analytics (drill-downs by use case and user segment), and trend visualization over time.

{% hint style="info" %}
For the full Scoring API reference including all method signatures and parameters, see the [SDK Documentation](https://app.gitbook.com/s/jHEEbkpMbUW2x51XS8Ez/scoring).
{% endhint %}

***

### Score Types

InteractiveAI supports **three score types**. Most production systems use a mix of all three across different evaluation dimensions.

| Type            | Description                              | Examples                                 |
| --------------- | ---------------------------------------- | ---------------------------------------- |
| **Numeric**     | Continuous values within a defined range | `0.87`, `4.5`, `92`                      |
| **Categorical** | Predefined labels from a fixed set       | `"good"`, `"needs_review"`, `"rejected"` |
| **Boolean**     | Pass/fail or true/false flags            | `True`, `False`                          |

***

### Score Sources

Scores can be generated through **three methods**:

| Source                        | Description                                                                                                            |
| ----------------------------- | ---------------------------------------------------------------------------------------------------------------------- |
| **LLM-as-a-Judge**            | Automated evaluation using a secondary LLM to grade outputs on criteria like factuality, style compliance, or toxicity |
| **Human Annotation**          | Manual review by your team through the annotation interface, establishing ground-truth benchmarks                      |
| **Custom Evaluation via SDK** | Programmatic scoring using custom quality checks, schema validation, or complex LLM workflows                          |

***

### Creating Scores

#### Scoring a Trace

Attach a score to a specific trace by ID:

{% tabs %}
{% tab title="Context Manager" %}
Use `score_trace()` on the span object for direct access, or `create_score()` with a trace ID:

```python
# Option 1: Using the span object
with interactiveai.start_as_current_observation(as_type="span", name="pipeline") as span:
    result = process_request(user_input)
    span.update(output={"result": result})

    span.score_trace(
        name="overall_quality",
        value=0.92,
        data_type="NUMERIC",
        comment="Response directly addressed the user's question"
    )

interactiveai.flush() # Ensures all pending data is sent
```

```python
# Option 2: Using create_score() with a trace ID (e.g., from outside the trace)

# Numeric score
interactiveai.create_score(
    trace_id="your-trace-id",
    name="relevance",
    value=0.92,
    data_type="NUMERIC",
    comment="Response directly addressed the user's question"
)

# Boolean score
interactiveai.create_score(
    trace_id="your-trace-id",
    name="factually_correct",
    value=True,
    data_type="BOOLEAN"
)

# Categorical score
interactiveai.create_score(
    trace_id="your-trace-id",
    name="quality",
    value="good",
    data_type="CATEGORICAL",
    comment="Clear and helpful response"
)

interactiveai.flush() # Ensures all pending data is sent
```

{% endtab %}

{% tab title="@observe Decorator" %}
Inside a decorated function, use `score_current_trace()`:

```python
from interactiveai import observe

@observe()
def handle_request(user_input):
    result = process_request(user_input)

    # Numeric score
    interactiveai.score_current_trace(
        name="relevance",
        value=0.92,
        data_type="NUMERIC",
        comment="Response directly addressed the user's question"
    )

    # Boolean score
    interactiveai.score_current_trace(
        name="factually_correct",
        value=True,
        data_type="BOOLEAN"
    )

    # Categorical score
    interactiveai.score_current_trace(
        name="quality",
        value="good",
        data_type="CATEGORICAL",
        comment="Clear and helpful response"
    )

    return result

handle_request("What's the weather in Madrid?")
interactiveai.flush()
```

{% endtab %}
{% endtabs %}

#### Scoring an Observation

Attach a score to a specific observation within a trace:

{% tabs %}
{% tab title="Context Manager" %}
Use `score()` on the observation object, or `create_score()` with both trace and observation IDs:

```python
# Option 1: Using the observation object
with interactiveai.start_as_current_observation(as_type="generation", name="llm-call") as generation:
    generation.update(
        input={"prompt": "Summarize this document"},
        output={"response": "The document discusses..."}
    )

    generation.score(
        name="generation_quality",
        value=4.5,
        data_type="NUMERIC",
        comment="Well-structured output with minor formatting issues"
    )

interactiveai.flush() # Ensures all pending data is sent
```

```python
# Option 2: Using create_score() with trace and observation IDs
interactiveai.create_score(
    trace_id="your-trace-id",
    observation_id="your-observation-id",
    name="generation_quality",
    value=4.5,
    data_type="NUMERIC",
    comment="Well-structured output with minor formatting issues"
)

interactiveai.flush() # Ensures all pending data is sent
```

{% endtab %}

{% tab title="@observe Decorator" %}
Inside a decorated function, use `score_current_span()`:

```python
from interactiveai import observe

@observe(as_type="generation", name="llm-call")
def generate_summary(document):
    response = llm.generate(f"Summarize: {document}")

    interactiveai.score_current_span(
        name="generation_quality",
        value=4.5,
        data_type="NUMERIC",
        comment="Well-structured output with minor formatting issues"
    )

    return response

generate_summary("The document content...")
interactiveai.flush() # Ensures all pending data is sent
```

{% endtab %}
{% endtabs %}

#### Scoring a Session

Sessions are scored when using `create_score()` with a `session_id`. There is no `score_current_session()` method because sessions span multiple traces and are not tied to a single execution context.

```python
interactiveai.create_score(
    session_id="your-session-id",
    name="conversation_success",
    value=True,
    data_type="BOOLEAN",
    comment="User goal was achieved within the session"
)

interactiveai.flush() # Ensures all pending data is sent
```

#### Adding Metadata to Scores

Include additional context with your scores using the `metadata` parameter:

```python
interactiveai.create_score(
    trace_id="your-trace-id",
    name="helpfulness",
    value=4,
    data_type="NUMERIC",
    comment="Provided actionable recommendations",
    metadata={
        "evaluator": "gpt-4",
        "evaluation_prompt_version": "v2.1",
        "criteria": ["actionable", "specific", "relevant"]
    }
)

interactiveai.flush() # Ensures all pending data is sent
```

***

### Viewing Scores

Scores are accessible from multiple locations throughout the platform depending on your workflow.

#### Scores Page

The dedicated Scores page under Govern provides a centralized view of all scores across your project. Use this view to filter, search, and analyze scores independently of their parent traces or sessions.

<div data-with-frame="true"><figure><img src="https://708770081-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F1ICwJbq7EJdn5kBgXnQu%2Fuploads%2F7ju7xnaEJWziFLRBCiDJ%2FScreenshot%202026-03-11%20at%2014.03.00.png?alt=media&#x26;token=150a872f-c68f-4c9c-9408-5d1a482cece8" alt=""><figcaption></figcaption></figure></div>

#### Trace Detail View

When inspecting a specific trace, navigate to the **Scores** tab in the detail panel to see all scores attached to that trace and its observations. This view displays the score name, value, and any associated comments.

<div data-with-frame="true"><figure><img src="https://708770081-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F1ICwJbq7EJdn5kBgXnQu%2Fuploads%2Ffv95yP45jH39TCG9YFke%2FClipboard-20260311-130531-370.gif?alt=media&#x26;token=aff6e6cc-7d82-4d0e-a071-bce1e6d155cb" alt=""><figcaption></figcaption></figure></div>

#### Table Views (Traces, Sessions, Datasets)

For Traces, Sessions, and Datasets tables, you can **add score columns** directly to the table for at-a-glance comparison across multiple items:

1. Click the columns icon in the top-right corner of the table
2. In the Columns panel, scroll to the **Scores** section
3. Select the scores you want to display as columns

<div data-with-frame="true"><figure><img src="https://708770081-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F1ICwJbq7EJdn5kBgXnQu%2Fuploads%2FfdcFh5E1eSyzRFLWtJf5%2Fimage.png?alt=media&#x26;token=0e234cd6-92b8-4189-a31c-eda79f0e3d06" alt=""><figcaption></figcaption></figure></div>

This approach is useful for sorting and filtering large sets of traces or sessions by specific quality metrics.

***

### Properties of a Score <a href="#properties-of-a-score" id="properties-of-a-score"></a>

| Property                  | Description                                                                                      |
| ------------------------- | ------------------------------------------------------------------------------------------------ |
| **Trace Name**            | Name of the associated trace                                                                     |
| **Trace**                 | Trace ID of the trace this score belongs to                                                      |
| **Environment**           | Deployment context like `production`, `default`, or `development`                                |
| **User**                  | End-user associated with the scored trace                                                        |
| **Timestamp**             | Creation time of the score                                                                       |
| **Source**                | Origin of the score: `EVAL` (automated evaluation) or `ANNOTATION` (human annotator)             |
| **Name**                  | Identifier for the score type (e.g., `"correctness"`, `"helpfulness"`, `"toxicity"`)             |
| **Data Type**             | One of `NUMERIC`, `BOOLEAN`, or `CATEGORICAL`                                                    |
| **Value**                 | The raw score value. Numeric for numeric/boolean scores; string for categorical                  |
| **Metadata**              | Free-form JSON for extra context                                                                 |
| **Comment**               | Free-text notes, such as evaluator feedback or reasoning                                         |
| **Author**                | User or system that created the score                                                            |
| **Eval Configuration ID** | References a predefined score configuration that defines the schema, type, range, and categories |
| **Trace Tags**            | Tags inherited from the associated trace                                                         |
