Evaluators

Evaluators automate quality assessment using LLM-as-a-Judge, a method where a language model scores the outputs of your AI application. The judge model receives trace data, applies your chosen evaluation criteria, and returns scores with a clear chain-of-thought reasoning.

Why Evaluators Matter

Manual evaluation doesn't scale. Reviewing thousands of outputs by hand is slow, expensive, and inconsistent, so LLM-as-a-Judge offers a practical alternative because of it:

  • Scalability: Evaluate thousands of outputs in minutes rather than days.

  • Nuance: Captures subjective dimensions like helpfulness, relevance, and tone better than rule-based metrics.

  • Repeatable: With a fixed rubric, the same evaluation prompt produces consistent scores across runs.

circle-exclamation

Evaluator Library View

Navigate to Improvement → Evaluators and select Evaluator Library in the top right to se all available evaluators in your project. The library displays each evaluator's name, maintainer (InteractiveAI or User), last edit date, usage count, version information and ID. In here you will find there are two main types of evaluators: InteractiveAI's Out-of-the-Box Evaluators and your Custom Evaluators.

Out-of-the-Box Evaluators

InteractiveAI provides a catalog of ready-to-use evaluators maintained by the platform. These capture best-practice evaluation prompts for common quality dimensions such as:

Evaluator
What it measures

Conciseness

Brevity without unnecessary content

Context-correctness

Whether retrieved context is accurate

Context-relevance

Whether retrieved context fits the query

Correctness

Factual accuracy compared to ground truth

Hallucination

Fabricated or unsupported information in the output

Helpfulness

Whether the response assists the user effectively

Relevance

How directly the output addresses the query

Toxicity

Harmful, offensive, or inappropriate language

Click any evaluator to inspect its prompt, scoring criteria, and configuration in depth.

Custom Evaluators

When the built-in evaluators don't fit your needs, you are able to create your own. Custom evaluators appear in the library with "User" as the maintainer.

To create a custom evaluator:

  1. Navigate to Evaluator Library

  2. Click + Set Up Evaluator

  3. Configure the evaluator:

    • Name: Identifier for this evaluator

    • Model: Toggle to use the default evaluation model or select a specific one

    • Score reasoning prompt: Instructions for how the judge should explain its reasoning

    • Score range prompt: Description of the scoring scale (e.g., "Score between 0 and 1")

    • Evaluation prompt: The full prompt template with {{variables}} placeholders

  4. Click Save

Your prompt can use variables like {{query}}, {{generation}}, and {{ground_truth}} that will be mapped to trace data when the evaluator runs.


Running Evaluators View

This view shows all your evaluator runs with their ID, score name, creation date, current status (active or inactive), and result counts. You'll also see links to the referenced evaluator configuration, the target that was evaluated, and which filters were applied.

When you need to dig deeper, click Open under the Logs Column to open the detailed execution log. This shows every trace that was evaluated along with its score value, the model's reasoning comment, and the final status. If something looks off or you want to understand exactly how the judge reached a particular score, click the trace link to inspect the full evaluation.

Since every LLM-as-a-Judge execution creates its own trace, you get complete visibility into what happened: the exact prompt sent to the judge, how the model responded, whether variable mapping worked correctly, and how many tokens were consumed.

Run an Evaluation

Once you have an evaluator (built-in or custom), configure where and how it should run.

To run an evaluator:

  1. Navigate to Running Evaluators

  2. Click Set Up Evaluator on the top right corner

  3. Select your evaluator for this run

  4. Configure the run:

  • Evaluator Runs On: Choose what data to evaluate.

    • New traces: Run on traces as they arrive

    • Existing traces: Run once on historical traces

  • Target Data: Select the data source.

    • Live Tracing Data: Production traces from your application

    • Experiment Runs: Outputs from dataset experiments

  • Target Filter: Narrow evaluation to specific subsets using filters like trace name, tags, environment, or dataset (only if you selected Experiment Runs on the above step).

  • Variable Mapping: Map each prompt variable to trace data.

    • Object: Select Trace or a specific observation

    • Object Variable: Choose Input, Output, or Metadata

    • JsonPath: Optionally extract a specific field from the JSON

  1. Review the Preview Sample Matched Traces to verify your filters match the expected data

  2. Click Execute

Last updated

Was this helpful?