# Evaluators

Evaluators automate quality assessment using **LLM-as-a-Judge**, a method where a language model **scores the outputs** of your AI application. The judge model receives trace data, applies your chosen evaluation criteria, and returns scores with a clear chain-of-thought reasoning.

### Why Evaluators Matter

Manual evaluation doesn't scale. Reviewing thousands of outputs by hand is slow, expensive, and inconsistent, so LLM-as-a-Judge offers a practical alternative because of it:

* **Scalability:** Evaluate thousands of outputs in minutes rather than days.
* **Nuance:** Captures subjective dimensions like helpfulness, relevance, and tone better than rule-based metrics.
* **Repeatable:** With a fixed rubric, the same evaluation prompt produces consistent scores across runs.

{% hint style="warning" %}
LLM evaluators can inherit biases, favor verbose responses, or produce inconsistent scores. Periodically validate automated scores against human annotation to ensure alignment. See [Annotations](https://docs.interactive.ai/improve/annotations) for establishing human baselines and calibration.
{% endhint %}

***

### Running an Evaluation

Once you have an evaluator (built-in or custom), configure where and how it should run. To run an evaluator:

{% tabs %}
{% tab title="Via InteractiveAI Platform" %}

1. Navigate to **Running Evaluators**
2. Click **Set Up Evaluator** on the top right corner
3. Select your evaluator for this run
4. Configure the run:

* **Evaluator Runs On**: Choose what data to evaluate.
  * **New traces**: Run on traces as they arrive
  * **Existing traces**: Run once on historical traces&#x20;
* **Target Data**: Select the data source.
  * **Live Tracing Data**: Production traces from your application
  * **Experiment Runs**: Outputs from dataset experiments
* **Target Filter**: Narrow evaluation to specific subsets using filters like trace name, tags, environment, or dataset (only if you selected Experiment Runs on the above step).
* **Variable Mapping**: Map each prompt variable to trace data.
  * **Object**: Select Trace or a specific observation
  * **Object Variable**: Choose Input, Output, or Metadata
  * **JsonPath**: Optionally extract a specific field from the JSON

4. Review the **Preview Sample Matched Traces** to verify your filters match the expected data
5. Click **Execute**
   {% endtab %}

{% tab title="Via InteractiveAI SDK" %}
Use `run_batched_evaluation()` to evaluate existing traces or observations programmatically at scale. The method fetches items from InteractiveAI using filters, transforms them with a mapper function, and runs evaluators on each item.

```python
from interactiveai import EvaluatorInputs, Evaluation

# Define a mapper to extract fields from traces
def trace_mapper(*, item, **kwargs):
    return EvaluatorInputs(
        input=item.input,
        output=item.output,
        expected_output=None,
        metadata={"trace_id": item.id}
    )

# Define an evaluator
def length_evaluator(*, input, output, expected_output, metadata, **kwargs):
    return Evaluation(
        name="output_length",
        value=len(output) if output else 0
    )

# Run batch evaluation
result = interactiveai.run_batched_evaluation(
    scope="traces",
    mapper=trace_mapper,
    evaluators=[length_evaluator],
    filter='{"tags": ["production"]}',
    max_items=1000,
    verbose=True
)

print(f"Processed {result.total_items_processed} traces")
print(f"Created {result.total_scores_created} scores")
```

{% endtab %}
{% endtabs %}

{% hint style="info" %}
For the full `run_batched_evaluation()` API reference including resume capability, composite evaluators, and observation-level evaluation, see the [SDK Documentation](https://app.gitbook.com/s/jHEEbkpMbUW2x51XS8Ez/experiments#run_batched_evaluation-source).
{% endhint %}

***

### Evaluator Library View

Navigate to **Improve → Evaluators** and select **Evaluator Library** in the top right to se all available evaluators in your project. The library displays each evaluator's name, maintainer (InteractiveAI or User), last edit date, usage count, version information and ID. In here you will find there are two main types of evaluators: InteractiveAI's **Out-of-the-Box Evaluators** and your **Custom Evaluators.**

<div data-with-frame="true"><figure><img src="https://708770081-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F1ICwJbq7EJdn5kBgXnQu%2Fuploads%2FThFIqcb7Nh2lBkAW0uAj%2Fimage.png?alt=media&#x26;token=06dc8b42-f0df-47e5-82e6-17dfd7d496fb" alt=""><figcaption></figcaption></figure></div>

#### **Out-of-the-Box Evaluators**

InteractiveAI provides a catalog of ready-to-use evaluators maintained by the platform. These capture best-practice evaluation prompts for common quality dimensions such as:

| Evaluator               | What it measures                                    |
| ----------------------- | --------------------------------------------------- |
| **Conciseness**         | Brevity without unnecessary content                 |
| **Context-correctness** | Whether retrieved context is accurate               |
| **Context-relevance**   | Whether retrieved context fits the query            |
| **Correctness**         | Factual accuracy compared to ground truth           |
| **Hallucination**       | Fabricated or unsupported information in the output |
| **Helpfulness**         | Whether the response assists the user effectively   |
| **Relevance**           | How directly the output addresses the query         |
| **Toxicity**            | Harmful, offensive, or inappropriate language       |

Click any evaluator to inspect its prompt, scoring criteria, and configuration in depth.

#### **Custom Evaluators**

When the built-in evaluators don't fit your needs, you are able to create your own. Custom evaluators appear in the library with "User" as the maintainer.

**To create a custom evaluator:**

1. Navigate to **Evaluator Library**
2. Click **+ Set Up Evaluator**
3. Configure the evaluator:
   * **Name**: Identifier for this evaluator
   * **Model**: Toggle to use the default evaluation model or select a specific one
   * **Score reasoning prompt**: Instructions for how the judge should explain its reasoning
   * **Score range prompt**: Description of the scoring scale (e.g., "Score between 0 and 1")
   * **Evaluation prompt**: The full prompt template with `{{variables}}` placeholders
4. Click **Save**

<div data-with-frame="true"><figure><img src="https://708770081-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F1ICwJbq7EJdn5kBgXnQu%2Fuploads%2FWrIbt6176RtFLUupGJqi%2Fimage.png?alt=media&#x26;token=cbc1f168-7b8f-4bc2-a474-b09bdca85dc0" alt=""><figcaption></figcaption></figure></div>

Your prompt can use variables like `{{query}}`, `{{generation}}`, and `{{ground_truth}}` that will be mapped to trace data when the evaluator runs.&#x20;

***

### Running Evaluators View

This view shows all your evaluator runs with their ID, score name, creation date, current status (active or inactive), and result counts. You'll also see links to the referenced evaluator configuration, the target that was evaluated, and which filters were applied.&#x20;

<div data-with-frame="true"><figure><img src="https://708770081-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F1ICwJbq7EJdn5kBgXnQu%2Fuploads%2FCSPMeA5WYergSSkAb6cE%2Fimage.png?alt=media&#x26;token=51102829-cdb0-4d21-b805-f069aa217647" alt=""><figcaption></figcaption></figure></div>

When you need to dig deeper, click **Open** under the **Logs Column** to open the detailed execution log. This shows every trace that was evaluated along with its score value, the model's reasoning comment, and the final status. If something looks off or you want to understand exactly how the judge reached a particular score, click the trace link to inspect the full evaluation.&#x20;

<div data-with-frame="true"><figure><img src="https://708770081-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F1ICwJbq7EJdn5kBgXnQu%2Fuploads%2FvdDK7EnwgfaYXrrRIdTi%2Fimage.png?alt=media&#x26;token=a0ee2280-0f44-490f-a8ed-194c05a7e94c" alt=""><figcaption></figcaption></figure></div>

Since every LLM-as-a-Judge execution creates its own trace, you get complete visibility into what happened: the exact prompt sent to the judge, how the model responded, whether variable mapping worked correctly, and how many tokens were consumed.
